ruby 1.9.1: Encoding trouble: broken US-ASCII String

Tom Link · Dec 14, 2008

Hi,

Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I have the following files:

testEncoding.rb:
#!/usr/bin/env ruby
# encoding: ISO-8859-1

p __ENCODING__

text = File.read("text.txt")
text.each_line do |line|
p line =~ /foo/
end

text.rb:
Foo äöü bar.

I use: ruby 1.9.1 (2008-12-01 revision 20438) [i368-cygwin]

If I run: ruby19 testEncoding.rb, I get:
#<Encoding:ISO-8859-1>
testEncoding.rb:8:in `block in <main>': broken US-ASCII string
(ArgumentError)

Ruby detects the encoding line but suspects the text file to be 7bit
ascii nevertheless. The source file encoding is only respected if I
add the command line option -E ISO-8859-1. I could also set the
encoding explicitly for each string but ...

I found some hints that the default charset for external sources is
deduced from the locale. So I set LANG to de_AT, de_AT.ISO-8859-1 and
some more variants with no avail.

How exactly is this supposed to work? What other options do I have to
make ASCII8BIT or Latin-1 the default encoding without having to
supply an extra command-line option and without having to rely on an
environment variable? Why isn't ASCII8BIT the default in the first
place? Why isn't __ENCODING__ a global variable I can assign a value
too?

Thanks,
Thomas.

Brian Candler · Dec 15, 2008

Tom said:
Right now, I'm not exactly thrilled by the way ruby 1.9 handles
encodings. Could somebody please explain things or point me to some
reference material:

I asked the same over at ruby-core recently. There were some useful
replies:

http://www.ruby-forum.com/topic/173179#759661

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)

text = File.read("text.txt")

This should work:

text = File.read("text.txt", :encoding=>"ISO-8859-1")

I still don't know how the default is worked out though.

Regards,

Brian.

Tom Link · Dec 15, 2008

text = File.read("text.txt", :encoding=>"ISO-8859-1")

Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

Many thanks for the pointer to the other thread over at ruby core.

Regards,
Thomas.

Brian Candler · Dec 15, 2008

Tom said:
Unfortunately, this isn't compatible with ruby 1.8. A script that uses
such a construct runs only with ruby 1.9. Sigh.

If all else fails, read the source.

I see that the encoding falls back to rb_default_external_encoding(),
which returns default_external, setting it if necessary from
rb_enc_from_index(default_external_index)

This in turn is set from rb_enc_set_default_external

This in turn is set from cmdline_options.ext.enc.name

And this in turn is set from the -E flag (or certain legacy settings on
-K). So:

$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

Yay. However, if it is possible to set the default external encoding
programatically (i.e. not via the command line options) I couldn't see
how.

Brian Candler · Dec 15, 2008

Brian said:
$ ruby19 -E ISO-8859-1 -e 'puts File.open("/etc/passwd").gets.encoding'
ISO-8859-1

D'oh. I see from original post that you knew this already.

It seems that Ruby keeps state for:
- default external encoding (e.g. for files being read in)
- default internal encoding (not sure what this is, you can set using -E
too but it defaults to nil)

and these are independent from the encodings of source files, which use
the magic comments to declare their encoding.

You can read these using Encoding.default_external and
Encoding.default_internal, but there don't appear to be setters for
them.

Brian Candler · Dec 15, 2008

Ah, there is a preview here:

http://books.google.co.uk/books?id=...X&oi=book_result&resnum=4&ct=result#PPA358,M1

Something like this may do the trick:

text = File.open("..") do |f|
f.set_encoding("ISO-8859-1") rescue nil
f.read
end

But then you may as well just do:

text.force_encoding("ISO-8859-1") rescue nil

I'm not sure in which way the regexp is incompatible with the data read.
I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

I can't really replicate without a hexdump of your text.txt. But it
would be interesting to see the result of:

text.each_line do |line|
p line.encoding
p /foo/.encoding
p line =~ /foo/
end

Maybe what's really needed is a sort of "anti-/u" option which means "my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

Anyway, I'm afraid all this increases my inclination to stick with ruby
1.8.6 :-(

James Gray · Dec 15, 2008

But the upshot is that this is all pretty much undocumented so far.
(Well it might be documented in the 3rd ed Pickaxe, but I'm not buying
that yet)

The Pickaxe does cover a lot of the new encoding behavior.

James Edward Gray II

James Gray · Dec 15, 2008

- default internal encoding (not sure what this is, you can set
using -E
too but it defaults to nil)

Default internal is the encoding IO objects will transcode incoming
data into, by default. So you could set this for UTF-8 and then read
from various different encodings (specifying each type in the open()
call), but only work with Unicode in your script.

James Edward Gray II

James Gray · Dec 15, 2008

I would have thought that a US-ASCII regexp should be able to match
ISO-8859-1 data, and perhaps vice versa, but it seems not.

It does:

$ ruby_dev -e 'p "r=E9sum=E9".encode("ISO-8859-1") =3D~ /foo/'
nil
$ ruby_dev -e 'p "r=E9sum=E9 foo".encode("ISO-8859-1") =3D~ /foo/'
7

Maybe what's really needed is a sort of "anti-/u" option which means =20=

"my
regexp literals are meant to match byte-at-a-time, not
character-at-a-time"

That's what BINARY means.

Anyway, I'm afraid all this increases my inclination to stick with =20
ruby
1.8.6 :-(

Perhaps it's a bit early to make this judgement since you've just =20
started learning about the new system?

There's a lot going on here, so it's a lot to take in. In places, the =20=

behavior is a little complex. However, the core team has put a lot of =20=

effort into making the system easier to use. It's getting there.

Also, even in it's current draft form, the Pickaxe answers every =20
question you've thrown at both mailing lists. Thus it should be a big =20=

help when you decide the time is right to pick it up.

James Edward Gray II

Brian Candler · Dec 15, 2008

James said:
It does:

$ ruby_dev -e 'p "rï¿½sumï¿½".encode("ISO-8859-1") =~ /foo/'
nil
$ ruby_dev -e 'p "rï¿½sumï¿½ foo".encode("ISO-8859-1") =~ /foo/'
7

I found that too, but was confused by the "broken US-ASCII string"
exception which the OP saw.

I suppose the external_encoding is defaulting to US-ASCII on that
system.

This means his program will break on every file passed into it which has
a character with the top bit set. You can argue that's "failsafe", in
the sense of bombing out rather than continuing processing with the
wrong encoding, and it therefore forces you to change your program or
the command-line args to specify the actual encoding in use.

However, that's pretty unforgiving. I can use Unix grep on a file with
unknown character set or broken UTF-8 characters and it works quite
happily.

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

irb(main):011:0> s = "foo\xff\xff\xffbar".force_encoding("BINARY")
=> "foo\xFF\xFF\xFFbar"
irb(main):012:0> s =~ /foo/
=> 0

That's what BINARY means.

On the String side, yes.

I was thinking of an option on the Regexp: /foo/b or somesuch.
(In contrast to /foo/u in 1.8 meaning 'this Regexp matches unicode')

Or you can you set BINARY encoding on the Regexp too? I couldn't see
how.

Tom Link · Dec 15, 2008

There's a lot going on here, so it's a lot to take in. In places, the

behavior is a little complex. However, the core team has put a lot of
effort into making the system easier to use. It's getting there.

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.

James Gray · Dec 15, 2008

It would have been nice though if the defaults had been chosen so that
they don't break 1.8 scripts -- or use some 8bit clean encoding if the
data contains 8bit wide characters instead of throwing an error.

I think it's probably more important to get this encoding interface
right than to worry about 1.8 compatibility. We knew 1.9 was going to
break some things, so the time was right.

Also, if you've been using the -KU switch in Ruby 1.8 and working with
UTF-8 data, 1.9 may work pretty well for you:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/19552

That's a pretty common "best practice" in the Ruby community, from
what I've seen. Even Rails pushes this approach now.

If you have gone this way though, you may want to migrate to the even
better -U switch in 1.9.

James Edward Gray II

James Gray · Dec 15, 2008

Wouldn't it be kinder to default to BINARY if the encoding is
unspecified?

The default encoding is pulled from your environment: LANG or
LC_CTYPE, I believe. This is very important and it makes simple
scripting fit in well with the environment.

James Edward Gray II

Brian Candler · Dec 15, 2008

Wouldn't it be kinder to default to BINARY if the encoding is

The default encoding is pulled from your environment: LANG or
LC_CTYPE, I believe. This is very important and it makes simple
scripting fit in well with the environment.

The code seems to say:
- if an encoding is chosen in the environment but is unknown to Ruby,
use ASCII-8BIT (aka BINARY)
- if Ruby was built on a system where it doesn't know how to ask the
environment for a language, then use US-ASCII

So I would read from this that the OP has either fallen foul of the
US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
else his environment has explicitly picked US-ASCII.

There must have been a good reason why US-ASCII was chosen, rather than
ASCII-8BIT, for systems without langinfo.h.

Regards,

Brian.

rb_locale_encoding(void)
{
VALUE charmap = rb_locale_charmap(rb_cEncoding);
int idx;

if (NIL_P(charmap))
idx = rb_usascii_encindex();
else if ((idx = rb_enc_find_index(StringValueCStr(charmap))) < 0)
idx = rb_ascii8bit_encindex();

if (rb_enc_registered("locale") < 0) enc_alias("locale", idx);

return rb_enc_from_index(idx);
}

...

VALUE
rb_locale_charmap(VALUE klass)
{
#if defined NO_LOCALE_CHARMAP
return rb_usascii_str_new2("ASCII-8BIT");
#elif defined HAVE_LANGINFO_H
char *codeset;
codeset = nl_langinfo(CODESET);
return rb_usascii_str_new2(codeset);
#elif defined _WIN32
return rb_sprintf("CP%d", GetACP());
#else
return Qnil;
#endif
}

Ollivier Robert · Dec 15, 2008

Perhaps it's a bit early to make this judgement since you've just =20
started learning about the new system?

From what I've seen and experimented with 1.9 for a few months, my main gripe
is that the whole encoding support is overly complex. I know m17n is not
solved by the magic unicode wand but I'd love to have a more simple way.

Brian Candler · Dec 15, 2008

Yukihiro said:
The whole picture must be complex, since encoding support itself is
VERY complex indeed. History sucks. But for daily use, just remember
specifying encoding if you are not sure what is the default_encoding,
e.g.

f = open(path, "r:iso-8859-1")

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

However, I also don't like the unstated assumption that all Strings
contain text.

In RFC2045 (MIME), there is a distinction made between 7bit text, 8bit
text, and binary data.

But if you label a string as "binary", Ruby changes this to
"ASCII-8BIT". I think that is a misrepresentation of that data, if it is
not actually ASCII-based text. I would much rather it made no assertion
about the content than a wrong assertion.

Dave Thomas · Dec 15, 2008

It seems to go against DRY to have to write "r:binary" or "rb:binary"
when opening lots of binary files. But if I remember to use
#!/usr/bin/ruby -Knw everywhere that should be OK.

You used to have to do that. In recent HEADS, rb sets binary encoding
automatically (unless overridden).

Dave

Tom Link · Dec 15, 2008

Also, if you've been using the -KU switch in Ruby 1.8 and working with

UTF-8 data, 1.9 may work pretty well for you

Well, I'm still stuck with latin-1. It's interesting though that
according to B Candler the fallback for unknown encodings should be 8-
bit clean and that US-ASCII should be only used as last resort. Maybe
it's just a cygwin thing?

Could we/I please get more information on how exactly the charset is
chosen depending on which environment variable and if this applies for
cygwin too? It appears to me that neither LANG nor LC_TYPE have any
effect on charset selection. But maybe I'm doing it wrong.

Regards,
Thomas.

Brian Candler · Dec 15, 2008

Yukihiro said:
open(path, "rb") is your friend. It sets encoding to binary.

Thanks.

"rb" is now performing two jobs then - prevent line-ending translation
(on those platforms which do it), and set encoding to binary. Something
to remember.

Tom Link · Dec 16, 2008

So I would read from this that the OP has either fallen foul of the

US-ASCII fallback (e.g. no langinfo.h when building under Cygwin), or
else his environment has explicitly picked US-ASCII.

Somebody mentions on http://bugs.python.org/issue3824 that:
"And nl_langinfo(CODESET) is useless on cygwin because it's always US-
ASCII."

And here: http://svn.xiph.org/trunk/vorbis-tools/intl/localcharset.c
"Cygwin 2006 does not have locales. nl_langinfo (CODESET) always
returns "US-ASCII"."

If I understood you right, this could cause the problems I
encountered.

Cygwin 1.7 is currently in beta. Maybe this improves things in this
respect?

Regards,
Thomas.

US-ASCII to UTF-8	2	Mar 9, 2010
Bug with Ruby/Tk encoding (ruby-1.9.1-rc1)	6	Jan 6, 2009
Ruby 1.9 - ArgumentError: incompatible encoding regexp match(US-ASCII regexp with ISO-2022-JP string	0	Mar 31, 2008
Ruby 1.9 - US-ASCII vs UTF-8	2	Dec 19, 2009
[ANN] Ruby 1.9.1-p376 is out	1	Dec 7, 2009
Ruby 1.9.1, HTTP and Encodings	0	Jun 24, 2009
Ruby 1.8 - character encoding	22	Jul 7, 2009
Ruby 1.9.1-p129: gem installation error?	0	May 13, 2009

ruby 1.9.1: Encoding trouble: broken US-ASCII String

Tom Link

Brian Candler

Tom Link

Brian Candler

Brian Candler

Brian Candler

James Gray

James Gray

James Gray

Brian Candler

Tom Link

James Gray

James Gray

Brian Candler

Ollivier Robert

Brian Candler

Dave Thomas

Tom Link

Brian Candler

Tom Link

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads