String encoding issues

O

Oliver Peng

I found several issues in string encoding. Here is the problem:

[root@mars mysql]# irb -E ascii
# I start irb with default external encoding ascii

irb(main):014:0> String.new.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):015:0> "".encoding
=> #<Encoding:US-ASCII>
# I get different encodings when I initialize an empty string. Why?

irb(main):023:0> "\x80".encoding
=> #<Encoding:ASCII-8BIT>
irb(main):024:0> "\x7F".encoding
=> #<Encoding:US-ASCII>

# It looks that if there is a ASCII value greater than 0x7F, it will use
ASCII-8BIT encoding. That is OK.

irb(main):005:0> new_str = "\xF1\xF2"
=> "\xF1\xF2"
irb(main):006:0> new_str.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):007:0> s ="%c%c%c%c%c%s" % [49, 5, 245, 225, 1, new_str]
Encoding::CompatibilityError: incompatible character encodings: US-ASCII
and ASCII-8BIT
from (irb):7:in `%'
from (irb):7
from /bin/irb:12:in `<main>'

# Now I try to use a ASCII-8BIT to format another string, it raises
exception. Why?

irb(main):008:0> s ="%c%c%c%c%c%s" % [49, 5, 45, 25, 1, new_str]
=> "1\x05-\x19\x01\xF1\xF2"

# I am very surprise that if I don't use value > 0x7F to format, it can
handle it.

irb(main):012:0> s ="%c%c%c%c%c" % [49, 5, 245, 225, 1]
=> "1\x05\xF5\xE1\x01"
irb(main):013:0> s.encoding
=> #<Encoding:US-ASCII>

# If I don't put the ASCII-8BIT string to format, it also works. But I
am very surprise that even there is a non-ASCII char inside the string,
the encoding is US-ASCII. Why?
 
O

Oliver Peng

I figure out the first question.

[root@mars mysql]# irb
irb(main):001:0> s = String.new
=> ""
irb(main):002:0> s.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):003:0> puts Encoding.default_external.name
UTF-8

Ruby will always use ASCII-8BIT as encoding when you use String.new to
create a new String object.
 
B

Brian Candler

Oliver said:
Ruby will always use ASCII-8BIT as encoding when you use String.new to
create a new String object.

Ugh. That's another special case to add to
http://github.com/candlerb/string19/blob/master/string19.rb

However in practice it doesn't matter much, because the empty string is
compatible.

irb(main):001:0> s1 = String.new
=> ""
irb(main):002:0> s2 = "groß"
=> "groß"
irb(main):003:0> s1.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> s2.encoding
=> #<Encoding:UTF-8>
irb(main):005:0> s1 + s2
=> "groß"

And as for this which you found:

irb(main):003:0> s = "%c%c%c%c%c".force_encoding("US-ASCII")
=> "%c%c%c%c%c"
irb(main):004:0> t = s % [49, 5, 245, 225, 1]
=> "1\x05\xF5\xE1\x01"
irb(main):005:0> t.encoding
=> #<Encoding:US-ASCII>

I think it's just one of the many bugs in ruby 1.9.x, likely due to a
total lack of specification of the new behaviour for all methods which
accept or return strings (although if there's no specification, I
suppose you can't really argue it's a bug; it can behave however it
likes)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top