Strange Encoding Behavior

Lui Kore · Jan 17, 2010

The encoding of __FILE__ is always the same as Encoding.default_external
even if there is a magic column. Sometimes it is necessary to convert
the string into another encoding. Here is some code to demonstrate the
issue:

#coding: utf-8
# put the script in a not-pure-ascii path to see the difference
path = File.expand_path File.dirname __FILE__

puts RUBY_VERSION + ' ' + RUBY_PLATFORM
#=> "1.9.1 i386-mingw32" is my ruby version
puts path.encoding
#=> "GB2312" on my OS

# usually this "string.encode to, from" works,
# but HERE the new string's content bytes seems unchanged
puts \
path.encode 'utf-8', Encoding.default_external

path.force_encoding Encoding.default_external
puts path.encode 'utf-8'
# changed at last

Luis Lavena · Jan 17, 2010

The encoding of __FILE__ is always the same as Encoding.default_external
even if there is a magic column. Sometimes it is necessary to convert
the string into another encoding. Here is some code to demonstrate the
issue:

#coding: utf-8
# put the script in a not-pure-ascii path to see the difference
path = File.expand_path File.dirname __FILE__

puts RUBY_VERSION + ' ' + RUBY_PLATFORM
#=> "1.9.1 i386-mingw32" is my ruby version
puts path.encoding
#=> "GB2312" on my OS

# usually this "string.encode to, from" works,
# but HERE the new string's content bytes seems unchanged
puts \
path.encode 'utf-8', Encoding.default_external

path.force_encoding Encoding.default_external
puts path.encode 'utf-8'
# changed at last

At 1.9.1, and some part of 1.9.2 still display certain issues with
path/folders with non-ascii characters:

http://redmine.ruby-lang.org/issues/show/1685

Lui Kore · Jan 17, 2010

I think #1685 is a little bit different.
Maybe the following code is a bit clearer:

# coding: ascii-8bit
puts Encoding.default_external #=> GBK

def enc s
s.encode 'utf-8', Encoding.default_external
end
p1 = File.expand_path File.dirname __FILE__
p2 = p1.dup
p2.force_encoding p2.encoding # strange, but makes it different

puts p1 == p2 #=> true
puts p1.encoding == p2.encoding #=> true
puts enc(p1) == enc(p2) #=> sometimes false ???

Run in console:

D:\å…¶ä»–>ruby t.rb
GBK
true
true
false

put it in another folder:

D:\other>ruby t.rb
GBK
true
true
true

Robert Klemme · Jan 17, 2010

The encoding of __FILE__ is always the same as Encoding.default_external
even if there is a magic column. Sometimes it is necessary to convert
the string into another encoding. Here is some code to demonstrate the
issue:

#coding: utf-8
# put the script in a not-pure-ascii path to see the difference
path = File.expand_path File.dirname __FILE__

puts RUBY_VERSION + ' ' + RUBY_PLATFORM
#=> "1.9.1 i386-mingw32" is my ruby version
puts path.encoding
#=> "GB2312" on my OS

# usually this "string.encode to, from" works,
# but HERE the new string's content bytes seems unchanged
puts \
path.encode 'utf-8', Encoding.default_external

path.force_encoding Encoding.default_external
puts path.encode 'utf-8'
# changed at last

I believe the point you are missing is that String#encode does not
change the String but it returns a new String with the desired encoding.
If you want inplace modification you need to use String#encode! which
does just that.

irb(main):006:0> s="foo"
=> "foo"
irb(main):007:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):008:0> x = s.encode "ASCII"
=> "foo"
irb(main):009:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):010:0> x.encoding
=> #<Encoding:US-ASCII>
irb(main):011:0>

Kind regards

robert

Lui Kore · Jan 18, 2010

I know String#encode doesn't change the original string, but the result
is encoded.

To understand the problem, you should try in a gbk/shift-jis environment
with some Chinese or Japanese path.

The point is:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode('utf-8') == p2.encode('utf-8') is not always true.

To describe it in a "encode!" version:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode!('utf-8')
p2.encode!('utf-8')
p1 == p2 is still not always true

Robert Klemme · Jan 18, 2010

2010/1/18 Lui Kore said:
I know String#encode doesn't change the original string, but the result
is encoded.

To understand the problem, you should try in a gbk/shift-jis environment
with some Chinese or Japanese path.

The point is:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode('utf-8') == p2.encode('utf-8') is not always true.

To describe it in a "encode!" version:
For some path p1 and p2,
when p1 == p2 and p1.encoding == p2.encoding,
p1.encode!('utf-8')
p2.encode!('utf-8')
p1 == p2 is still not always true

Apparently I misread your posting, sorry. Is UTF-8 capable of
representing those Japanese or Chinese characters? I believe I
remember Matz saying that UTF-8 is insufficient to properly represent
Japanese characters. If this is the case then I guess all bets are
off and you get undefined behavior. Although it might be desirable to
get the same garbage it may not be worthwhile to ensure this purely
for efficiency reasons.

Kind regards

robert

Nokogiri SAX parser encoding problem	6	Aug 24, 2010
A question about Ruby 1.9's "external encoding"	5	Mar 20, 2011
Reading a CSV file with UTF-16LE encoding	4	Jan 13, 2011
How do I set the encoding on a regexp ?	19	Feb 23, 2010
files.py (encoding error)	0	Jun 10, 2013
character encoding question	2	Mar 26, 2010
Problem with String encoding when modifying it in C method	5	Apr 3, 2009
win32 ruby1.9 regexp and cyrillic string	3	Apr 27, 2010

Strange Encoding Behavior

Lui Kore

Luis Lavena

Lui Kore

Robert Klemme

Lui Kore

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads