gsub: invalid byte sequence in US-ASCII

Discussion in 'Ruby' started by R.. Kumar, Jun 15, 2010.

  1. R.. Kumar

    R.. Kumar Guest

    I download the page http://www.ruby-forum.com/forum/4 using wget. Then i
    cat the file and pipe to gsub.

    I get: -e:1:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)


    wget -q -k -O index11.html http://www.ruby-forum.com/forum/4

    cat index11.html | ruby -pe 'gsub(/href=a\/"/,"href=\"'${base}'")' >
    ofile

    (The value of base is http://www.ruby-forum.com/)

    So what must i do so this command can run. It runs fine with another
    site.
    If i replace ruby with perl -pe 's|....|g' that works fine.

    I actually run this in a loop with various URLS from cron.
    --
    Posted via http://www.ruby-forum.com/.
     
    R.. Kumar, Jun 15, 2010
    #1
    1. Advertising

  2. R.. Kumar wrote:
    > If i replace ruby with perl -pe 's|....|g' that works fine.


    Replacing ruby 1.9.x with ruby 1.8.x is just as effective, and I would
    recommend this for maintaining your sanity.

    I can only guess that the external encoding picked up from your
    platform's environment is US-ASCII (are you using cygwin by any chance?)

    You probably need to set the external encoding to UTF-8 or BINARY for
    your regexp not to crash. Try adding -Ku or -Kn to your ruby command
    line.

    If you want to attempt to understand String encoding in ruby 1.9, then
    good luck to you. I tried, documented what I found here:
    http://github.com/candlerb/string19/blob/master/string19.rb
    and gave up after about 200 rules. There is no official documentation.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Jun 15, 2010
    #2
    1. Advertising

  3. On 6/15/10, R.. Kumar <> wrote:
    > I download the page http://www.ruby-forum.com/forum/4 using wget. Then i
    > cat the file and pipe to gsub.
    >
    > I get: -e:1:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
    >
    >
    > wget -q -k -O index11.html http://www.ruby-forum.com/forum/4
    >
    > cat index11.html | ruby -pe 'gsub(/href=a\/"/,"href=\"'${base}'")' >
    > ofile
    >
    > (The value of base is http://www.ruby-forum.com/)
    >
    > So what must i do so this command can run. It runs fine with another
    > site.
    > If i replace ruby with perl -pe 's|....|g' that works fine.
    >
    > I actually run this in a loop with various URLS from cron.


    Handling this kind of thing right means tracking encodings right....
    which means you'd have to extract the encoding from the http session
    and then mark the input as that encoding in your ruby script... and
    then deal with the inevitable incompatible encoding errors that would
    crop up.

    It sounds to me, tho, like in this case what you have a just some
    hacky little scripts and it would be acceptable for them to be
    imperfect. So, in that case, I suggest trying to set the encoding for
    your source file(s) to BINARY. That's a hack, but it ought to be
    effective.

    Alternately, you could drop back to the 1.8 interpreter, like Brian
    suggests, which more or less uses BINARY as the default source
    encoding.
     
    Caleb Clausen, Jun 15, 2010
    #3
  4. R.. Kumar

    Bill Kelly Guest

    Caleb Clausen wrote:
    >
    > Handling this kind of thing right means tracking encodings right....
    > which means you'd have to extract the encoding from the http session
    > and then mark the input as that encoding in your ruby script... and
    > then deal with the inevitable incompatible encoding errors that would
    > crop up.
    >
    > It sounds to me, tho, like in this case what you have a just some
    > hacky little scripts and it would be acceptable for them to be
    > imperfect. So, in that case, I suggest trying to set the encoding for
    > your source file(s) to BINARY. That's a hack, but it ought to be
    > effective.


    Additional info on the source, external, and internal encodings:

    http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings


    For the OP, I'd expect `ruby -EBINARY ...` or `ruby -EASCII-8BIT ...`
    should work.


    Regards,

    Bill
     
    Bill Kelly, Jun 15, 2010
    #4
  5. R.. Kumar

    R.. Kumar Guest

    Brian Candler wrote:
    > R.. Kumar wrote:
    >> If i replace ruby with perl -pe 's|....|g' that works fine.

    >
    > Replacing ruby 1.9.x with ruby 1.8.x is just as effective, and I would
    > recommend this for maintaining your sanity.
    >
    > I can only guess that the external encoding picked up from your
    > platform's environment is US-ASCII (are you using cygwin by any chance?)
    >
    > You probably need to set the external encoding to UTF-8 or BINARY for
    > your regexp not to crash. Try adding -Ku or -Kn to your ruby command
    > line.
    >
    > If you want to attempt to understand String encoding in ruby 1.9, then
    > good luck to you. I tried, documented what I found here:
    > http://github.com/candlerb/string19/blob/master/string19.rb
    > and gave up after about 200 rules. There is no official documentation.


    1. I have moved to 1.9 long back. Don't want to move back.

    2. I am on OSX. I think I had _probably_ (?) solved this issue on my
    previous laptop (PPC) -- now I;ve migrated my user to a new machine
    (Snow Leopard). All my settings, should have moved. (I say this since I
    had commented out the perl line).

    LC_ALL=en_US.UTF-8
    LC_CTYPE=en_US.UTF-8
    LANG=C

    3. Thanks for the link, i will read it. But NO, i have already read up
    enough a few months back, and do not have the energy to do it again :-(.

    Thanks for the tip on -Ku / -Kn
    --
    Posted via http://www.ruby-forum.com/.
     
    R.. Kumar, Jun 16, 2010
    #5
  6. R.. Kumar

    R.. Kumar Guest

    Brian Candler wrote:

    >
    > I can only guess that the external encoding picked up from your
    > platform's environment is US-ASCII (are you using cygwin by any chance?)
    >
    > You probably need to set the external encoding to UTF-8 or BINARY for
    > your regexp not to crash. Try adding -Ku or -Kn to your ruby command
    > line.


    Ok, I've got it. The problem occured when the program was run by cron.
    My user setting is UTF and it ran fine in terminal. So now in the
    program itself I have set LC_CTYPE and LC_ALL to en_US.UTF-8. Hopefully,
    it should work fine now.

    Thanks.
    --
    Posted via http://www.ruby-forum.com/.
     
    R.. Kumar, Jun 16, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Arun
    Replies:
    2
    Views:
    14,599
    William Brogden
    Dec 27, 2004
  2. KN
    Replies:
    6
    Views:
    20,557
    Richard Tobin
    Nov 15, 2007
  3. dk
    Replies:
    6
    Views:
    9,189
    Roedy Green
    Jan 22, 2010
  4. Luther
    Replies:
    15
    Views:
    670
    Jason O.
    Nov 10, 2010
  5. Sven Koesling

    again: invalid byte sequence in US-ASCII

    Sven Koesling, May 1, 2011, in forum: Ruby
    Replies:
    2
    Views:
    166
    Sven Koesling
    May 1, 2011
Loading...

Share This Page