[ruby 1.9] reading an UTF-8 encoded file

Une Bévue · Mar 10, 2010

if i read and output to terminal an UTF-8 encoded file, i do not have
the same result with ruby 1.8.x and ruby 1.9

with 1.8 i get "é" correctly, with 1.9 i get it wrong "Ã©" even if i
specify the encoding by :
open(__FILE__, "r:UTF-8") do ...

what did i missunderstood ?

Brian Candler · Mar 10, 2010

Une said:
if i read and output to terminal an UTF-8 encoded file, i do not have
the same result with ruby 1.8.x and ruby 1.9

with 1.8 i get "ï¿½" correctly, with 1.9 i get it wrong "Ã©" even if i
specify the encoding by :
open(__FILE__, "r:UTF-8") do ...

In my web browser onto ruby-forum, I see what you say is the "correct"
symbol as invalid above, and the "wrong" symbol is a valid one.

Are you in irb, or running code in a .rb file? Are you using "puts" or
are you looking at the string values as returned by irb, after the =>
prompt?

In either case, show your actual code. Beware that things behave
strangely in irb with 1.9. Some of the oddities I noticed in irb are
documented in
http://github.com/candlerb/string19/blob/master/string19.rb
from about line 1648.

what did i missunderstood ?

Remember that encodings by themselves don't actually change the sequence
of bytes. If your code is something like this:

open("somefile.txt") do |f|
while line = f.gets
puts line
end
end

and you run it as a .rb script, I would expect it to work the same in
both 1.8 and 1.9. That is, it should read lines and squirt them back out
to stdout unchanged. No transcoding is done. If they appear wrongly, it
would be because the encoding of the file contents is not the same as
the encoding of your terminal.

Furthermore, it makes no difference in 1.9 if you do this:

open("somefile.txt","r:UTF-8") do |f|
while line = f.gets
puts line
end
end

In ruby 1.9, all this means is that the string 'line' will be tagged as
being UTF-8, rather than some encoding picked up from the environment.
However by default, the same sequence of bytes will be squirted out.

However in 1.9 you *can* cause the string to be transcoded, if:

(1) you specify a different internal and external encoding when reading
the data (so it gets transcoded on input); or

(2) you specify an external encoding when writing the data (so it gets
transcoded on output)

HTH,

Brian

Une Bévue · Mar 11, 2010

Brian Candler said:
Are you in irb, or running code in a .rb file? Are you using "puts" or
are you looking at the string values as returned by irb, after the =>
prompt?

In either case, show your actual code. Beware that things behave
strangely in irb with 1.9.

first, thanks for your reply )))

i'm not using irb rather an rb file kaunched from Terminal

here is the code (ruby 1.9) :

------------------------------------------------------------------------
#! /usr/local/bin/macruby
# encoding: utf-8

SIGNATURES_FILE = "/Users/yt/dev/Signature/signatures.txt"

open(SIGNATURES_FILE, "r:UTF-8") do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end
open(SIGNATURES_FILE) do |file|
p file.internal_encoding
file.each do |line|
p [line.encoding.name, line]
end
end
------------------------------------------------------------------------

resulting in :
zsh-% ./essai_macruby.rb
nil
["US-ASCII", "-- \n"]
["US-ASCII", "Â« Un banquier est toujours en libertÃ© provisoire Â» \n"]
["US-ASCII", "(Henri PoincarÃ© )\n"]

....

["US-ASCII", "la minute de vÃ©ritÃ© risque de se faire longtemps
attendre. Â» \n"]
["US-ASCII", "(Pierre Dac)\n"]
nil
["US-ASCII", "-- \n"]
["US-ASCII", "Â« Un banquier est toujours en libertÃ© provisoire Â» \n"]
["US-ASCII", "(Henri PoincarÃ© )\n"]
[
....

["US-ASCII", "la minute de vÃ©ritÃ© risque de se faire longtemps
attendre. Â» \n"]
["US-ASCII", "(Pierre Dac)\n"]
zsh-%

then, both methods (with and without "r:UTF-8") see the file as being of
US-ASCII although they are really UTF-8 encoded.

now the "equivalent" test using ruby 1.8.* :

------------------------------------------------------------------------
#! /usr/bin/env ruby

SIGNATURES_FILE = "/Users/yt/dev/Signature/signatures.txt"

open(SIGNATURES_FILE) do |file|
file.each do |line|
puts line
end
end
------------------------------------------------------------------------

run from Term :

zsh-% ./essai.rb
--
« Un banquier est toujours en liberté provisoire »
(Henri Poincaré )

....

« Pour ceux qui vont chercher midi à quatorze heures,
la minute de vérité risque de se faire longtemps attendre. »
(Pierre Dac)

accentuated chars are correct now, notice i have to use "puts" instead
of "p" to get the chars otherwise i got the unicode code as
"v\303\251rit\303\251".

Brian Candler · Mar 11, 2010

Use 'puts' instead of 'p' and it may work. That is, I suspect
String#inspect is doing some mangling.

You really should look at your postings in ruby-forum:
http://www.ruby-forum.com/topic/205792

Wherever you say ruby 1.9 is giving the 'wrong' output it is correct,
and where you say ruby 1.8 is giving the 'right' output it is wrong. I
have a suspicion that there is a mismatch between the file content and
the terminal.

What if you just type "cat /Users/yt/dev/Signature/signatures.txt" at
the terminal?

accentuated chars are correct now, notice i have to use "puts" instead
of "p" to get the chars otherwise i got the unicode code as
"v\303\251rit\303\251".

Yes, String#inspect in ruby 1.8 will mangle all values over 128 into
escaped form. String#inspect in ruby 1.9 behaves differently, and
doesn't always mangle them.

However, I just noticed 'macruby' in your scripts. Are you actually
running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it's macruby
all bets are off - I thought it was a completely different interpreter
written from scratch. I have no Mac here to compare behaviour with, and
I have no idea what variation of 1.9 encoding rules MacRuby has
implemented.

In particular, I'm surprised that your program sees strings tagged as
"US-ASCII" rather than "UTF-8" when you explicitly opened the file with
external encoding of UTF-8. This makes me very suspicious of your actual
ruby platform.

Try adding this line to your code to get info about the Ruby platform:
p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }

Regards,

Brian.

P.S. For comparison, here's what I get with an oldish ruby pre-1.9.2
under Linux. Try these on your system.
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

Une Bévue · Mar 11, 2010

Brian Candler said:
Use 'puts' instead of 'p' and it may work. That is, I suspect
String#inspect is doing some mangling.

You really should look at your postings in ruby-forum:
http://www.ruby-forum.com/topic/205792

Wherever you say ruby 1.9 is giving the 'wrong' output it is correct,
and where you say ruby 1.8 is giving the 'right' output it is wrong. I
have a suspicion that there is a mismatch between the file content and
the terminal.

What if you just type "cat /Users/yt/dev/Signature/signatures.txt" at
the terminal?

i got the correct chars :
zsh-% cat /Users/yt/dev/Signature/signatures.txt
--
« Un banquier est toujours en liberté provisoire »
(Henri Poincaré )

....
--
« Pour ceux qui vont chercher midi à quatorze heures,
la minute de vérité risque de se faire longtemps attendre. »
(Pierre Dac)

Yes, String#inspect in ruby 1.8 will mangle all values over 128 into
escaped form. String#inspect in ruby 1.9 behaves differently, and
doesn't always mangle them.

However, I just noticed 'macruby' in your scripts. Are you actually
running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it's macruby
all bets are off - I thought it was a completely different interpreter
written from scratch. I have no Mac here to compare behaviour with, and
I have no idea what variation of 1.9 encoding rules MacRuby has
implemented.

In particular, I'm surprised that your program sees strings tagged as
"US-ASCII" rather than "UTF-8" when you explicitly opened the file with
external encoding of UTF-8. This makes me very suspicious of your actual
ruby platform.

right now, that's to say using puts in place of p, i get the right
chars.
But those strings are still taged by "US-ASCII"...

Try adding this line to your code to get info about the Ruby platform:
p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }

it seems to be an "old" Ruby 1.9.0 :

[[:RUBY_VERSION, "1.9.0"], [:RUBY_RELEASE_DATE, "2008-06-03"],
[:RUBY_PLATFORM, "universal-darwin10.0"], [:RUBY_PATCHLEVEL, 0],
[:RUBY_REVISION, 0], [:RUBY_DESCRIPTION, "MacRuby version 0.5 (ruby
1.9.0) [universal-darwin10.0, x86_64]"], [:RUBY_COPYRIGHT, "MacRuby -
Copyright (C) 2007-2008 Apple Inc."], [:RUBY_ENGINE, "macruby"],
[:RUBY_ARCH, "x86_64"], [:MACRUBY_VERSION, "0.5"], [:MACRUBY_REVISION,
"svn revision 3380 from
http://svn.macosforge.org/repository/ruby/MacRuby/branches/0.5"]]

thanks again !

Brian Candler · Mar 11, 2010

Ah, so it's not ruby 1.9, it's macruby 0.5. I'm afraid I'll have to
defer to the Mac experts here.

I see there that macruby has its own mailing lists:
http://www.macruby.org/contact-us.html

Une Bévue · Mar 11, 2010

Une Bévue said:
right now, that's to say using puts in place of p, i get the right
chars.
But those strings are still taged by "US-ASCII"...

however, when doing some spliting by :

def get_signatures
t = "".force_encoding("UTF-8")
open(SIGNATURES_FILE, "r:UTF-8") do |file|
#open(SIGNATURES_FILE) do |file|
file.each do |line|
t += line.force_encoding("UTF-8")
end
end
#File.open(SIGNATURES_FILE, "r:UTF-8").each {|l| t += l }
return t.split(NEEDLE)
end

(notice i've forced the encoding)

signatures = get_signatures
c = signatures.count
puts "Nombre de signatures : #{c}"

r = rand(c)
puts "Signature aléatoire (n° #{r}) :"
signature = NEEDLE + signatures[r]
puts signature

the output is wrong in that case ???

Nombre de signatures : 29
Signature aléatoire (n° 1) :

Une Bévue · Mar 11, 2010

Brian Candler said:
Ah, so it's not ruby 1.9, it's macruby 0.5. I'm afraid I'll have to
defer to the Mac experts here.

ok i'll ask there even it is based upon ruby 1.9 :
[:RUBY_VERSION, "1.9.0"], [:RUBY_RELEASE_DATE, "2008-06-03"]

Brian Candler · Mar 11, 2010

If forcing the encoding of individual strings makes Ruby output them
differently, then I expect that STDOUT must have an external encoding
set.

Try this:

p STDOUT.external_encoding

What do you get? If you get something other than nil, it means that puts
will transcode characters from the tagged encoding to this encoding.

In ruby 1.9, STDOUT.external_encoding is nil unless you set it
explicitly.
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

Une Bévue · Mar 11, 2010

Brian Candler said:
If forcing the encoding of individual strings makes Ruby output them
differently, then I expect that STDOUT must have an external encoding
set.

Try this:

p STDOUT.external_encoding

What do you get? If you get something other than nil, it means that puts
will transcode characters from the tagged encoding to this encoding.

I got nil ))

In ruby 1.9, STDOUT.external_encoding is nil unless you set it
explicitly.
=> "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

yes right, I'll ask to the MacRuby list.

In fact, i do have also a buitin ruby 1.8.x but i'd rather make use of
1.9 because i do have to count UTF-8 chars and i know this is internal
with ruby 1.9 and because i might design an UI it's better using MacRuby
because it is written on top of Obj-C and Cocoa.

Une Bévue · Mar 11, 2010

Brian Candler said:
Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

Perfectly right, because i read on a MacRuby web page
(http://www.macruby.org/documentation/overview.html) :

Primitives Classes
The primitive Ruby classes (String, Array, and Hash) have been
re-implemented on top of their Cocoa equivalents (respectively,
NSString, NSArray, and NSDictionary).

As an example, String is no longer a class, but a pointer (alias) to
NSMutableString. All strings in MacRuby are genuine Cocoa strings and
can be passed (without conversion) to underlying C or Objective-C APIs
that expect Cocoa strings.

The whole String interface was re-implemented on top of NSString. This
means that you can call any method of String on any Cocoa string.
Because Cocoa strings can be either mutable and immutable, if you try to
call a method that is supposed to modify its receiver on an immutable
string, a runtime exception will be raised.

Une Bévue · Mar 11, 2010

Une Bévue said:
Perfectly right, because i read on a MacRuby web page
(http://www.macruby.org/documentation/overview.html) :

Then i've installed ruby 1.9 :
zsh-% ruby1.9 -v
ruby 1.9.1p376 (2009-12-07 revision 26041) [i386-darwin10]

and, when using it without forcing the encoding i get the right chars...

then i'm sure the prob comes from MacRuby.

Une Bévue · Mar 11, 2010

Brian Candler said:
Perhaps the version of 1.9.0 which macruby forked from was different in
this regard though.

yes, i get the answer from MacRuby list :
1.9 encodings in trunk have very little support for now, but we
significantly improved them in a branch that might get merged into trunk
in a few days (maybe today). I will post an update here once it's done.

Ruby 1.9.2 UTF-8 Encoding issues whiles reading/writing files	2	Nov 17, 2010
StringScanner and UTF-8 in ruby 1.9	0	Sep 16, 2009
UTF-8 read & print?	6	Nov 25, 2012
Ruby 1.9 # coding: utf-8	5	Mar 27, 2009
Ruby 1.9 - US-ASCII vs UTF-8	2	Dec 19, 2009
Rdoc does not document UTF-8 files?	3	Jun 10, 2009
to_yaml in utf-8 encoding	7	Apr 8, 2011
ruby unicode/string explosion (0xFF in utf-8)	2	Dec 10, 2010

[ruby 1.9] reading an UTF-8 encoded file

Une Bévue

Brian Candler

Une Bévue

Brian Candler

Une Bévue

Brian Candler

Une Bévue

Une Bévue

Brian Candler

Une Bévue

Une Bévue

Une Bévue

Une Bévue

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads