[ruby 1.9] reading an UTF-8 encoded file

Discussion in 'Ruby' started by Une Bévue, Mar 10, 2010.

  1. Une Bévue

    Une Bévue Guest

    if i read and output to terminal an UTF-8 encoded file, i do not have
    the same result with ruby 1.8.x and ruby 1.9

    with 1.8 i get "é" correctly, with 1.9 i get it wrong "é" even if i
    specify the encoding by :
    open(__FILE__, "r:UTF-8") do ...

    what did i missunderstood ?
     
    Une Bévue, Mar 10, 2010
    #1
    1. Advertisements

  2. In my web browser onto ruby-forum, I see what you say is the "correct"
    symbol as invalid above, and the "wrong" symbol is a valid one.

    Are you in irb, or running code in a .rb file? Are you using "puts" or
    are you looking at the string values as returned by irb, after the =>
    prompt?

    In either case, show your actual code. Beware that things behave
    strangely in irb with 1.9. Some of the oddities I noticed in irb are
    documented in
    http://github.com/candlerb/string19/blob/master/string19.rb
    from about line 1648.
    Remember that encodings by themselves don't actually change the sequence
    of bytes. If your code is something like this:

    open("somefile.txt") do |f|
    while line = f.gets
    puts line
    end
    end

    and you run it as a .rb script, I would expect it to work the same in
    both 1.8 and 1.9. That is, it should read lines and squirt them back out
    to stdout unchanged. No transcoding is done. If they appear wrongly, it
    would be because the encoding of the file contents is not the same as
    the encoding of your terminal.

    Furthermore, it makes no difference in 1.9 if you do this:

    open("somefile.txt","r:UTF-8") do |f|
    while line = f.gets
    puts line
    end
    end

    In ruby 1.9, all this means is that the string 'line' will be tagged as
    being UTF-8, rather than some encoding picked up from the environment.
    However by default, the same sequence of bytes will be squirted out.

    However in 1.9 you *can* cause the string to be transcoded, if:

    (1) you specify a different internal and external encoding when reading
    the data (so it gets transcoded on input); or

    (2) you specify an external encoding when writing the data (so it gets
    transcoded on output)

    HTH,

    Brian
     
    Brian Candler, Mar 10, 2010
    #2
    1. Advertisements

  3. Une Bévue

    Une Bévue Guest

    first, thanks for your reply )))

    i'm not using irb rather an rb file kaunched from Terminal

    here is the code (ruby 1.9) :

    ------------------------------------------------------------------------
    #! /usr/local/bin/macruby
    # encoding: utf-8

    SIGNATURES_FILE = "/Users/yt/dev/Signature/signatures.txt"

    open(SIGNATURES_FILE, "r:UTF-8") do |file|
    p file.internal_encoding
    file.each do |line|
    p [line.encoding.name, line]
    end
    end
    open(SIGNATURES_FILE) do |file|
    p file.internal_encoding
    file.each do |line|
    p [line.encoding.name, line]
    end
    end
    ------------------------------------------------------------------------

    resulting in :
    zsh-% ./essai_macruby.rb
    nil
    ["US-ASCII", "-- \n"]
    ["US-ASCII", "« Un banquier est toujours en liberté provisoire » \n"]
    ["US-ASCII", "(Henri Poincaré )\n"]

    ....

    ["US-ASCII", "la minute de vérité risque de se faire longtemps
    attendre. » \n"]
    ["US-ASCII", "(Pierre Dac)\n"]
    nil
    ["US-ASCII", "-- \n"]
    ["US-ASCII", "« Un banquier est toujours en liberté provisoire » \n"]
    ["US-ASCII", "(Henri Poincaré )\n"]
    [
    ....

    ["US-ASCII", "la minute de vérité risque de se faire longtemps
    attendre. » \n"]
    ["US-ASCII", "(Pierre Dac)\n"]
    zsh-%


    then, both methods (with and without "r:UTF-8") see the file as being of
    US-ASCII although they are really UTF-8 encoded.

    now the "equivalent" test using ruby 1.8.* :

    ------------------------------------------------------------------------
    #! /usr/bin/env ruby

    SIGNATURES_FILE = "/Users/yt/dev/Signature/signatures.txt"

    open(SIGNATURES_FILE) do |file|
    file.each do |line|
    puts line
    end
    end
    ------------------------------------------------------------------------

    run from Term :

    zsh-% ./essai.rb
    --
    « Un banquier est toujours en liberté provisoire »
    (Henri Poincaré )

    ....

    « Pour ceux qui vont chercher midi à quatorze heures,
    la minute de vérité risque de se faire longtemps attendre. »
    (Pierre Dac)


    accentuated chars are correct now, notice i have to use "puts" instead
    of "p" to get the chars otherwise i got the unicode code as
    "v\303\251rit\303\251".
     
    Une Bévue, Mar 11, 2010
    #3
  4. Use 'puts' instead of 'p' and it may work. That is, I suspect
    String#inspect is doing some mangling.

    You really should look at your postings in ruby-forum:
    http://www.ruby-forum.com/topic/205792

    Wherever you say ruby 1.9 is giving the 'wrong' output it is correct,
    and where you say ruby 1.8 is giving the 'right' output it is wrong. I
    have a suspicion that there is a mismatch between the file content and
    the terminal.

    What if you just type "cat /Users/yt/dev/Signature/signatures.txt" at
    the terminal?
    Yes, String#inspect in ruby 1.8 will mangle all values over 128 into
    escaped form. String#inspect in ruby 1.9 behaves differently, and
    doesn't always mangle them.

    However, I just noticed 'macruby' in your scripts. Are you actually
    running MacRuby, or genuine Matz Ruby Interpreter 1.9 ? If it's macruby
    all bets are off - I thought it was a completely different interpreter
    written from scratch. I have no Mac here to compare behaviour with, and
    I have no idea what variation of 1.9 encoding rules MacRuby has
    implemented.

    In particular, I'm surprised that your program sees strings tagged as
    "US-ASCII" rather than "UTF-8" when you explicitly opened the file with
    external encoding of UTF-8. This makes me very suspicious of your actual
    ruby platform.

    Try adding this line to your code to get info about the Ruby platform:
    p Object.constants.grep(/RUBY/).map { |n| [n, Object.const_get(n)] }

    Regards,

    Brian.

    P.S. For comparison, here's what I get with an oldish ruby pre-1.9.2
    under Linux. Try these on your system.
    => "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"
     
    Brian Candler, Mar 11, 2010
    #4
  5. Une Bévue

    Une Bévue Guest

    i got the correct chars :
    zsh-% cat /Users/yt/dev/Signature/signatures.txt
    --
    « Un banquier est toujours en liberté provisoire »
    (Henri Poincaré )

    ....
    --
    « Pour ceux qui vont chercher midi à quatorze heures,
    la minute de vérité risque de se faire longtemps attendre. »
    (Pierre Dac)
    right now, that's to say using puts in place of p, i get the right
    chars.
    But those strings are still taged by "US-ASCII"...
    it seems to be an "old" Ruby 1.9.0 :

    [[:RUBY_VERSION, "1.9.0"], [:RUBY_RELEASE_DATE, "2008-06-03"],
    [:RUBY_PLATFORM, "universal-darwin10.0"], [:RUBY_PATCHLEVEL, 0],
    [:RUBY_REVISION, 0], [:RUBY_DESCRIPTION, "MacRuby version 0.5 (ruby
    1.9.0) [universal-darwin10.0, x86_64]"], [:RUBY_COPYRIGHT, "MacRuby -
    Copyright (C) 2007-2008 Apple Inc."], [:RUBY_ENGINE, "macruby"],
    [:RUBY_ARCH, "x86_64"], [:MACRUBY_VERSION, "0.5"], [:MACRUBY_REVISION,
    "svn revision 3380 from
    http://svn.macosforge.org/repository/ruby/MacRuby/branches/0.5"]]

    thanks again !
     
    Une Bévue, Mar 11, 2010
    #5
  6. Ah, so it's not ruby 1.9, it's macruby 0.5. I'm afraid I'll have to
    defer to the Mac experts here.

    I see there that macruby has its own mailing lists:
    http://www.macruby.org/contact-us.html
     
    Brian Candler, Mar 11, 2010
    #6
  7. Une Bévue

    Une Bévue Guest

    however, when doing some spliting by :

    def get_signatures
    t = "".force_encoding("UTF-8")
    open(SIGNATURES_FILE, "r:UTF-8") do |file|
    #open(SIGNATURES_FILE) do |file|
    file.each do |line|
    t += line.force_encoding("UTF-8")
    end
    end
    #File.open(SIGNATURES_FILE, "r:UTF-8").each {|l| t += l }
    return t.split(NEEDLE)
    end

    (notice i've forced the encoding)

    signatures = get_signatures
    c = signatures.count
    puts "Nombre de signatures : #{c}"

    r = rand(c)
    puts "Signature aléatoire (n° #{r}) :"
    signature = NEEDLE + signatures[r]
    puts signature

    the output is wrong in that case ???

    Nombre de signatures : 29
    Signature aléatoire (n° 1) :
     
    Une Bévue, Mar 11, 2010
    #7
  8. Une Bévue

    Une Bévue Guest

    ok i'll ask there even it is based upon ruby 1.9 :
    [:RUBY_VERSION, "1.9.0"], [:RUBY_RELEASE_DATE, "2008-06-03"]
     
    Une Bévue, Mar 11, 2010
    #8
  9. If forcing the encoding of individual strings makes Ruby output them
    differently, then I expect that STDOUT must have an external encoding
    set.

    Try this:

    p STDOUT.external_encoding

    What do you get? If you get something other than nil, it means that puts
    will transcode characters from the tagged encoding to this encoding.

    In ruby 1.9, STDOUT.external_encoding is nil unless you set it
    explicitly.
    => "ruby 1.9.2dev (2009-07-18 trunk 24186) [i686-linux]"

    Perhaps the version of 1.9.0 which macruby forked from was different in
    this regard though.
     
    Brian Candler, Mar 11, 2010
    #9
  10. Une Bévue

    Une Bévue Guest

    I got nil ))
    yes right, I'll ask to the MacRuby list.

    In fact, i do have also a buitin ruby 1.8.x but i'd rather make use of
    1.9 because i do have to count UTF-8 chars and i know this is internal
    with ruby 1.9 and because i might design an UI it's better using MacRuby
    because it is written on top of Obj-C and Cocoa.
     
    Une Bévue, Mar 11, 2010
    #10
  11. Une Bévue

    Une Bévue Guest

    Perfectly right, because i read on a MacRuby web page
    (http://www.macruby.org/documentation/overview.html) :

    Primitives Classes
    The primitive Ruby classes (String, Array, and Hash) have been
    re-implemented on top of their Cocoa equivalents (respectively,
    NSString, NSArray, and NSDictionary).

    As an example, String is no longer a class, but a pointer (alias) to
    NSMutableString. All strings in MacRuby are genuine Cocoa strings and
    can be passed (without conversion) to underlying C or Objective-C APIs
    that expect Cocoa strings.

    The whole String interface was re-implemented on top of NSString. This
    means that you can call any method of String on any Cocoa string.
    Because Cocoa strings can be either mutable and immutable, if you try to
    call a method that is supposed to modify its receiver on an immutable
    string, a runtime exception will be raised.
     
    Une Bévue, Mar 11, 2010
    #11
  12. Une Bévue

    Une Bévue Guest

    Then i've installed ruby 1.9 :
    zsh-% ruby1.9 -v
    ruby 1.9.1p376 (2009-12-07 revision 26041) [i386-darwin10]

    and, when using it without forcing the encoding i get the right chars...

    then i'm sure the prob comes from MacRuby.
     
    Une Bévue, Mar 11, 2010
    #12
  13. Une Bévue

    Une Bévue Guest

    yes, i get the answer from MacRuby list :
    1.9 encodings in trunk have very little support for now, but we
    significantly improved them in a branch that might get merged into trunk
    in a few days (maybe today). I will post an update here once it's done.
     
    Une Bévue, Mar 11, 2010
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.