invalid byte sequence in US-ASCII (ArgumentError)

Luther · Feb 16, 2009

I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I've
tried a couple of different magic comments with utf-8, but the error
message still shows the same "US-ASCII". I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren't even allowed in strings anymore.

I'm sure this must be a common migration problem, but I can't find a
solution no matter how hard I search the web. Any help would be greatly
appreciated.

Luther

Tim Hunter · Feb 16, 2009

Luther said:
I'm having some trouble migrating from 1.8 to 1.9.1. I have this line of
code:

text.gsub! "\C-m", ''

...which generates this error:

/home/luther/bin/dos2gnu:16:in `gsub!': invalid byte sequence in
US-ASCII (ArgumentError)

The purpose is to strip out any ^M characters from the string. I've
tried a couple of different magic comments with utf-8, but the error
message still shows the same "US-ASCII". I also tried changing the \C-m
to 13.chr, but I still got the same error, suggesting that control
characters aren't even allowed in strings anymore.

I'm sure this must be a common migration problem, but I can't find a
solution no matter how hard I search the web. Any help would be greatly
appreciated.

Luther

Since Ruby is claiming the source file is US-ASCII it seems likely that
it's not noticing the magic comment. Make sure your magic comment is the
first line in the script, or if you're using a shebang line, the second
line. That is, either

# encoding: utf-8

or

#! /usr/local/bin/ruby
# encoding: utf-8

Luther · Feb 16, 2009

Since Ruby is claiming the source file is US-ASCII it seems likely that
it's not noticing the magic comment. Make sure your magic comment is the
first line in the script, or if you're using a shebang line, the second
line. That is, either

# encoding: utf-8

or

#! /usr/local/bin/ruby
# encoding: utf-8

I put the encoding line right after my shebang line, but it had no
effect.

In further investigation, I tried running my program on a different text
file, and it worked fine. The original text file had some very odd
characters at the beginning and the end of the file. Once I deleted that
metadata, my program worked fine.

This means the problem was with the "text" variable rather than the
arguments. This seems very wrong to me since it threw an ArgumentError.
Or maybe I don't know anything about exceptions.

So, my problem is partially solved, but now I know my program will puke
on any text file with multibyte characters.

Luther

Tom Link · Feb 16, 2009

but now I know my program will puke

on any text file with multibyte characters.

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

Brian Candler · Feb 16, 2009

Tom said:
Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

So if you deal with data which is not text (as I do all the time), you
need to put

File.open("....", :encoding => "BINARY")

everywhere. And even then, if you ask the open File object what it's
encoding is, it will say ASCII8BIT, even though you explicitly told it
that it's BINARY.

This is because "BINARY" is just a synonym for "ASCII8BIT" in ruby. Of
course, there is plenty of data out there which is not encoded using the
American Standard Code for Information Interchange. MIME distinguishes
clearly between 8BIT (text with high bit set) and BINARY (non-text). In
terms of Ruby's processing it makes no difference, but it's annoying for
Ruby to tell me that my data is text, when it is not.

Note: in more recent 1.9's, I believe that

File.open("....", "rb")

has the effect of doing two things:
1. Disabling line-ending translation under Windows
2. Setting encoding to ASCII8BIT

So this may be sufficient for your needs, and it has the advantage that
the same code will run under ruby <1.9.

Brian Candler · Feb 16, 2009

Brian said:
Yes. IMO this is a horrendous misfeature of ruby 1.9: it asserts that
all external data is text, unless explicitly told otherwise.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

I'm not saying that Ruby shouldn't handling encodings and conversions;
I'm just saying you should ask for them. For example:

File.open("....", :encoding => "UTF-8") # Use this encoding
File.open("....", :encoding => "ENV") # Follow the environment
File.open("....") # No idea, treat as binary

I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that's a pain.

Stefan Lang · Feb 16, 2009

Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

Also, if you open a file with the "b" flag, it sets the files
encoding to binary. You should use that flag in 1.8, too,
otherwise Windows will do line ending conversion, corrupting
your binary data.

And worse: the encoding chosen comes from the environment. So your
program which you developed on one system and runs correctly there may
fail totally on another.

It has to default to some encoding. Your OS installation has a
default encoding. It's a sane decision to use that, because
otherwise many scripts wouldn't work by default on your machine.

I'm not saying that Ruby shouldn't handling encodings and conversions;
I'm just saying you should ask for them. For example:

File.open("....", :encoding => "UTF-8") # Use this encoding

Well, you can do exactly that...

File.open("....", :encoding => "ENV") # Follow the environment

This is the default.

File.open("....") # No idea, treat as binary

Use "b" flag, which you should do on 1.8 anyway.

I'm not going to use 1.9 without wrapper scripts to invoke Ruby with
appropriate flags to force the external encoding to a fixed value. And
that's a pain.

You can set it with Encoding.default_external= at the top of your script.

Stefan

Brian Candler · Feb 16, 2009

Stefan said:
Ruby must choose between treating all external data as
text unless told otherwise or treat everything as binary
unless told otherwise, because there is no general way
to know if a file is binary or text.

Yes (and I wouldn't want it to try to guess)

Given that Ruby is mostly used to work with text, it's a
sensible decision to use text mode by default.

That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB... Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don't want my programs to crash in these cases.

It has to default to some encoding.

That's where I also disagree. It can default to stream of bytes.

This is the default.

That's what I don't want. Given this default, I must either:

(1) Force all my source to have the correct encoding flag set
everywhere. If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

Encoding.default_external=

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.

Stefan Lang · Feb 16, 2009

2009/2/16 Brian Candler said:
Yes (and I wouldn't want it to try to guess)

That's where I disagree. There are tons of non-text applications:
images, compression, PDFs, Marshall, DRB...

Point taken.

Furthermore, as the OP
demonstrated, there are plenty of usage cases where files are presented
which are almost ASCII, but not quite. The default behaviour now is to
crash, rather than to treat these as streams of bytes.

I don't want my programs to crash in these cases.

Let's compare. Situation: I'm reading binary files and
forget to specify the "b" flag when opening the file(s).

Result in Ruby 1.8:
* My stuff works fine on Linux/Unix. Somebody else runs
the script on Windows, the script corrupts data because
Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
I fix the problem by specifying the "b" flag on open. Done.

I definitely prefer Ruby 1.9 behavior.

That's where I also disagree. It can default to stream of bytes.

That's what I don't want. Given this default, I must either:

Assuming the default is always wrong.

(1) Force all my source to have the correct encoding flag set
everywhere.If I don't test for this, my programs will fail in
unexpected ways. Tests for this are awkward; they'd have to set the
environment to a certain locale (e.g. UTF-8), pass in data which is not
valid in that locale, and check no exception is raised.

(2) Use a wrapper script either to call Ruby with the correct
command-line flags, or to sanitise the environment.

I guess I can use that at the top of everything in bin/ directory. It
may be sufficient, but it's annoying to have to remember that too.

Here's why the default is good, IMO. The cases where I really
don't want to explicitly specify encodings is when I write one liners
(-e) and short throwaway scripts. If the default encoding were binary,
string operations would deal incorrectly with German (my native language)
accents. Using the locale encoding does the right thing here.

If I write a longer program, explicitly setting the default external
encoding isn't an effort worth mentioning. Set it to ASCII_8BIT
and it behaves like 1.8.

Stefan

Tom Link · Feb 16, 2009

Result in Ruby 1.8:

* My stuff works fine on Linux/Unix. Somebody else runs
=A0 the script on Windows, the script corrupts data because
=A0 Windows does line ending conversion.

Result in Ruby 1.9:
* On the first run on my Linux machine, I get an EncodingError.
=A0 =A0I fix the problem by specifying the "b" flag on open. Done.

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won't work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

This is also the reason why I think opening text files as binary isn't
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

Gary Wright · Feb 16, 2009

Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings?

Because at the operating system level Windows distinguishes between
text and binary files and Unix doesn't.

The "b" option has been part of the standard C library for decades and
Windows is not the only operating system that distinguishes between
text and binary files.

Proper handling of line termination requires the library to know if it
is working with a binary or text file. On Unix it doesn't matter if
you fail to give the library correct information (i.e. omit the b flag
for binary files) but your code becomes non-portable. It will fail on
systems that treat text and binary files differently.

Gary Wright

Stefan Lang · Feb 16, 2009

2009/2/16 Tom Link said:
Actually, this is a point I have never quite understood. Why does only
the windows version convert line endings? Is it out of question that
somebody could want to process a text file created under windows on a
linux box or virtual machine? Regular expressions that check only for
\n but not \r won't work then. Now you could of course take the stance
that you simply have to check for \r too, but then why automatically
convert line separators under windows? Or did I miss something
obvious?

It's the underlying C API that does the line ending conversion.
Ruby inherited that behavior.

This is also the reason why I think opening text files as binary isn't
really a solution. It leads to either convoluted regexps or non-
portable code. (Unless I missed something obvious, which is quite
possible.)

I personally find it somewhat confusing having to juggle with
different encodings. IMHO it would have been preferable to define a
fixed internal encoding (uft16 or whatever) and to transcode every
string that is known to be text and identifiers to that canonical/
uniform encoding and to deal with everything else as a sequence of
bytes.

Ruby does that when you set the internal encoding with
Encoding.default_internal=

BTW I recently skimmed through the python3000 user guide. From what I
understand, they seem to distinguish between strings as (binary) data
and strings as text (encoded as utf).

There were many and long discussions about the encoding API,
mostly on Ruby core. If you search the archives you can find
why the current API is how it is.

IIRC, these were important issues:

* We don't have a single internal string encoding (like Java and Python)
because there are many Ruby users, especially Asian, that still
have to work with legacy encodings for which a lossless Unicode
round-trip is not possible. They'd be forced to use the
binary API.

* Because Ruby already has a rich String API, and because it simplifies
porting of 1.8 code, there is no separate data type for binary strings.

Stefan

Tom Link · Feb 16, 2009

It's the underlying C API that does the line ending conversion.

Ruby inherited that behavior.

Thanks for the clarification (also thanks to Gary).

Ruby does that when you set the internal encoding with
Encoding.default_internal=

Unfortunately this isn't entirely true -- see:
http://groups.google.com/group/ruby-core-google/browse_frm/thread/9103687d4ee9f336?hl=en#

It doesn't convert strings & identifiers in scripts. Of course, there
are good reasons for that (see the responses in the thread) if you
don't define a canonical internal encoding.

There were many and long discussions about the encoding API,
mostly on Ruby core.

I know that people who understand the issues at hand much better than
I do discussed this subject extensively. I still struggle to fully
understand their conclusions though. But this probably is only a
matter of time.

Regards,
Thomas.

Luther · Feb 17, 2009

Not necessarily.

Here is a useful summary of encodings in 1.9:
http://blog.nuclearsquid.com/writings/ruby-1-9-encodings

Basically, you have script encoding, internal encoding, and external
encoding. In you case, you should probably read the files as ASCII8BIT
or binary, I guess.

Thank you. I've put 'r:binary' in the line where I open the file, and
now it seems to work fine. Although if I didn't want to be lazy, I would
probably read it with the default encoding, catch the ArgumentError,
then reread the file.

Thanks again...
Luther

Luther Thompson · Feb 17, 2009

Tom said:
When I recently stumbled over not so different problems (one of which
is described here [1]) it was because the external encoding (see
Encoding.default_external) defaulted to US-ASCII on cygwin because
ruby191RC0 ignored the windows locale and the value of the LANG
variable -- the part with the windows locale was fixed in the
meantime. AFAIK if ruby 191 cannot determine the environment's locale,
it defaults to US-ASCII which causes the described problem if a
character is > 127.

Actually, I always set my LANG to C. Since my original post, I found
that I had forgotten to set my LC_CTYPE to en_US.UTF-8, which is
Ubuntu's default. After fixing that, I still got the same error, but
with "UTF-8" instead of "US-ASCII".

I believe the metadata in that text file must be binary code that was
put there by some word processor, because I remember seeing "Helvetica"
somewhere in there.

Luther

Jason O. · Nov 10, 2010

The magic encoding comment didn't cut it for me. I found the answer in
my case by adding the following to my environment.rb (I run a mixed 1.8
and 1.9 environment):

if RUBY_VERSION =~ /1.9/
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
end

again: invalid byte sequence in US-ASCII	2	May 1, 2011
gsub: invalid byte sequence in US-ASCII	5	Jun 15, 2010
slice! invalid byte sequence in UTF-8	9	Mar 3, 2011
Ruby 1.9.2: How to sanitize text with invalid characters?	6	Oct 11, 2010
Facing exception: Invalid byte 2 of 4-byte UTF-8 sequence.	6	Jan 21, 2010
Invalid byte 2 of 3-byte UTF-8 sequence - inconsistent behavior	6	Nov 15, 2007
ruby unicode/string explosion (0xFF in utf-8)	2	Dec 11, 2010
InputStream - invalid byte 1 of 1-byte UTF-8 sequence	2	Dec 27, 2004

invalid byte sequence in US-ASCII (ArgumentError)

Luther

Tim Hunter

Luther

Tom Link

Brian Candler

Brian Candler

Stefan Lang

Brian Candler

Stefan Lang

Tom Link

Gary Wright

Stefan Lang

Tom Link

Luther

Luther Thompson

Jason O.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads