ruby 1.9 hates you and me and the encodings we rode in on so just getused to it.

Discussion in 'Ruby' started by DJ Jazzy Linefeed, May 16, 2008.

  1. def prep_file(path)

    ret = ''

    x = File.open(path)

    x.lines.each do |l|
    l.gsub!('\n', ' ')
    ret << l
    end

    puts ret

    end
    ....
    compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
    from compare.rb:64:in `block in prep_file'
    from compare.rb:63:in `each_line'
    from compare.rb:63:in `call'
    from compare.rb:63:in `each'
    from compare.rb:63:in `prep_file'
    from compare.rb:144:in `<main>'

    Hm. Okay, I love you ruby, we can just talk this thing out and I can
    get back to...

    x.lines.each do |l|
    ret << l
    end

    # (I love you too)

    Alright baby, Daddy gets confused and angry sometimes... do you wanna
    make a little string love...?

    ret = ''

    x = File.open(path)

    x.lines.each do |l|
    ret << l
    end

    puts ret.class

    # String

    Mhm, it smells like you do. Why don't we take this off...

    x.lines.each do |l|
    ret << l
    end

    puts ret.gsub!('a', 'test')

    end
    ....
    compare.rb:69:in `gsub!': broken UTF-8 string (ArgumentError)
    from compare.rb:69:in `prep_file'
    from compare.rb:145:in `<main>'

    Hey, Ruby, if it's that week of the month we can just cuddle. Here,
    try this...


    x.lines.each do |l|
    ret << l
    end

    puts ret
    ....
    # (big string)

    See, thats good. Thats a string and that's something we have in
    common, maybe we were just talking about different encodings. Let's
    see what it's made of.

    puts ret.encoding

    # UTF-8

    I'm gonna go get a gallon of milk and I'll be back soon. You wait
    right there. (grumbles)
    DJ Jazzy Linefeed, May 16, 2008
    #1
    1. Advertising

  2. DJ Jazzy Linefeed

    7stud -- Guest

    Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    DJ Jazzy Linefeed wrote:
    >
    > I'm gonna go get a gallon of milk and I'll be back soon. You wait
    > right there. (grumbles)
    >


    Just shut your eyes and hum the mantra, "Ruby doesn't get in your way.
    Ruby doesn't get in your way." You obviously need to get "cleared".
    Please set up an appointment with your nearest Church of Scientology.

    --
    Posted via http://www.ruby-forum.com/.
    7stud --, May 20, 2008
    #2
    1. Advertising

  3. DJ Jazzy Linefeed

    Todd Benson Guest

    Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    On Fri, May 16, 2008 at 4:10 PM, DJ Jazzy Linefeed
    <> wrote:
    > def prep_file(path)
    >
    > ret = ''
    >
    > x = File.open(path)
    >
    > x.lines.each do |l|
    > l.gsub!('\n', ' ')
    > ret << l
    > end
    >
    > puts ret
    >
    > end
    > ...
    > compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
    > from compare.rb:64:in `block in prep_file'
    > from compare.rb:63:in `each_line'
    > from compare.rb:63:in `call'
    > from compare.rb:63:in `each'
    > from compare.rb:63:in `prep_file'
    > from compare.rb:144:in `<main>'
    >
    > Hm. Okay, I love you ruby, we can just talk this thing out and I can
    > get back to...
    >
    > x.lines.each do |l|
    > ret << l
    > end
    >
    > # (I love you too)
    >
    > Alright baby, Daddy gets confused and angry sometimes... do you wanna
    > make a little string love...?
    >
    > ret = ''
    >
    > x = File.open(path)
    >
    > x.lines.each do |l|
    > ret << l
    > end
    >
    > puts ret.class
    >
    > # String
    >
    > Mhm, it smells like you do. Why don't we take this off...
    >
    > x.lines.each do |l|
    > ret << l
    > end
    >
    > puts ret.gsub!('a', 'test')
    >
    > end
    > ...
    > compare.rb:69:in `gsub!': broken UTF-8 string (ArgumentError)
    > from compare.rb:69:in `prep_file'
    > from compare.rb:145:in `<main>'
    >
    > Hey, Ruby, if it's that week of the month we can just cuddle. Here,
    > try this...
    >
    >
    > x.lines.each do |l|
    > ret << l
    > end
    >
    > puts ret
    > ...
    > # (big string)
    >
    > See, thats good. Thats a string and that's something we have in
    > common, maybe we were just talking about different encodings. Let's
    > see what it's made of.
    >
    > puts ret.encoding
    >
    > # UTF-8
    >
    > I'm gonna go get a gallon of milk and I'll be back soon. You wait
    > right there. (grumbles)


    I love it. Another person that wants a babel fish. The irony is in
    the language demonstrating as much. In other words, I need Jazzy
    Linefeed encoding (I left off the DJ because there might be other
    types of linefeeds).

    Todd
    Todd Benson, May 20, 2008
    #3
  4. DJ Jazzy Linefeed

    Gary Watson Guest

    Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Thanks for the suggestion of using ascii-8bit. This solved my problem.

    The line of code that was giving me fits was the following line. Worked
    in 1.8 but didn't work in 1.9.0

    puts Dir["**/*"].select {|x| x.match(/(jpg)$/)}

    when I changed it to this

    puts Dir["**/*"].select {|x|
    x.force_encoding("ascii-8bit").match(/(jpg)$/)}

    all was well.

    Regards,
    Gary


    Yukihiro Matsumoto wrote:
    > Hi,
    >
    > In message "Re: ruby 1.9 hates you and me and the encodings we rode in
    > on so just get used to it."
    > on Sat, 17 May 2008 06:10:05 +0900, DJ Jazzy Linefeed
    > <> writes:
    > |
    > |def prep_file(path)
    > |
    > | ret = ''
    > |
    > | x = File.open(path)
    > |
    > | x.lines.each do |l|
    > | l.gsub!('\n', ' ')
    > | ret << l
    > | end
    > |
    > | puts ret
    > |
    > |end
    > |...
    > |compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
    > | from compare.rb:64:in `block in prep_file'
    > | from compare.rb:63:in `each_line'
    > | from compare.rb:63:in `call'
    > | from compare.rb:63:in `each'
    > | from compare.rb:63:in `prep_file'
    > | from compare.rb:144:in `<main>'
    >
    > Regular expression operation does not work fine on broken strings. It
    > seems that you specify utf-8 for your locale, yet the content of
    > reading file is not. If you know the encoding of the content, say
    > iso-8859-1, you can open it with the explicit encoding:
    >
    > x = File.open(path, "r:iso-8859-1")
    >
    > if not, you can say it
    >
    > x = File.open(path, "r:ascii-8bit")
    >
    > unless the file content is non ASCII like UTF-16.
    >
    > matz.


    --
    Posted via http://www.ruby-forum.com/.
    Gary Watson, Dec 27, 2009
    #4
  5. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Yukihiro Matsumoto wrote:
    > on Sat, 17 May 2008 06:10:05 +0900, DJ Jazzy Linefeed
    > <> writes:
    > > l.gsub!('\n', ' ')

    [snip]
    > Regular expression operation does not work fine on broken strings. It


    An off-topic question:

    So String#gsub always use the regexp engine (even if the pattern is a
    plain string).

    Now,

    Is there a way, in Ruby, to do search/replace that don't involve the
    regexp engine?

    I'm asking this because I figure that not using the regexp engine would
    be faster (but maybe it'll be only marginally faster, I don't know).

    I know one can do...

    s[ 'find' ] = 'replace'

    ...but it replaces only one occurance the the substring (and does it
    skip the regexp engine?).
    --
    Posted via http://www.ruby-forum.com/.
    Albert Schlef, Dec 27, 2009
    #5
  6. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    DJ Jazzy Linefeed wrote:
    > compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)


    Yep. Ruby 1.9 will raise exceptions in all sorts of odd places,
    dependent on both the tagged encoding of the string *and* its content at
    that point in time.

    I got as far as recording 200 behaviours of String in ruby 1.9 before I
    gave up:
    http://github.com/candlerb/string19/blob/master/string19.rb

    The solution I use is simple: stick to ruby 1.8.x. When that branch
    dies, perhaps reia will be ready. If not I'll move to something else.

    IMO, both python 3 and erlang have got the right idea when it comes to
    handling UTF8.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Dec 27, 2009
    #6
  7. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    2009/12/27 Brian Candler <>

    > DJ Jazzy Linefeed wrote:
    > > compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)

    >
    > Yep. Ruby 1.9 will raise exceptions in all sorts of odd places,
    > dependent on both the tagged encoding of the string *and* its content at
    > that point in time.
    >
    > I got as far as recording 200 behaviours of String in ruby 1.9 before I
    > gave up:
    > http://github.com/candlerb/string19/blob/master/string19.rb
    >
    > The solution I use is simple: stick to ruby 1.8.x. When that branch
    > dies, perhaps reia will be ready. If not I'll move to something else.
    >
    > IMO, both python 3 and erlang have got the right idea when it comes to
    > handling UTF8.
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >

    Hi,

    I got this kind of problem yesterday too.

    While taking some file names with Dir#[], I got some special results.

    I was searching for "bad" file names, I mean file names with =C3=A9,=C3=AA =
    or
    whatever. When I print the String given in the block directly, no problem.

    But then I come with things like:
    /Users/benoitdaloze/Library/GlestGame/data/lang/espan>=CC=83<ol.lng

    (The ~ is separated from the n and then is not =C3=B1). The Regexp is actin=
    g like
    it is 2 different characters. How to handle that easily? I tried to change
    the script encoding in MacRoman, but it produced an error of bad encoding
    not matching UTF-8.

    as output of this script (which is then not able to rename any wrong file,
    because tr! seem to not work either on name) :

    path =3D ARGV[0] || "/"

    ALLOWED_CHARS =3D "A-Za-z0-9 %#:$@?!=3D+~&|'()\\[\\]{}.,\r_-"

    Dir["#{File.expand_path(path)}/**/*"].each { |f|
    name =3D File.basename(f)
    unless name =3D~ /^[#{ALLOWED_CHARS}]+$/
    puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
    ">\\1<")

    if name.tr!('=C3=A9=C3=A8=C3=AA', 'e') =3D~ /^[#{ALLOWED_CHARS}]+$/=
    # Here it is not
    complete, it is just a test, but it doesn't work even for 'fil=C3=A9name'
    File.rename(f, File.dirname(f) + '/' + name)
    puts "\trenamed in #{name}"
    break
    end
    end
    }
    Benoit Daloze, Dec 27, 2009
    #7
  8. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Brian Candler wrote:
    > DJ Jazzy Linefeed wrote:
    >
    >> compare.rb:64:in `gsub': broken UTF-8 string (ArgumentError)
    >>

    >
    > Yep. Ruby 1.9 will raise exceptions in all sorts of odd places,
    > dependent on both the tagged encoding of the string *and* its content at
    > that point in time.
    >


    If you don't arbitrarily set the encoding, when will this be a problem?

    Edward
    Edward Middleton, Dec 28, 2009
    #8
  9. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Benoit Daloze wrote:
    > But then I come with things like:
    > /Users/benoitdaloze/Library/GlestGame/data/lang/espan>̃<ol.lng
    >
    > (The ~ is separated from the n and then is not ñ). The Regexp is acting
    > like
    > it is 2 different characters. How to handle that easily? I tried to
    > change
    > the script encoding in MacRoman, but it produced an error of bad
    > encoding
    > not matching UTF-8.


    I don't know what you mean. If Dir.[] tells you that the file name is
    <e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
    filename?

    I suggest you try something like this:

    puts "Source encoding: #{"".encoding}"
    puts "External encoding: #{Encoding.default_external}"
    Dir["*.lng"] do |fn|
    puts "Name: #{fn.inspect}"
    puts "Encoding: #{fn.encoding}"
    puts "Chars: #{fn.chars.to_a.inspect}"
    puts "Codepoints: #{fn.codepoints.to_a.inspect}"
    puts "Bytes: #{fn.bytes.to_a.inspect}"
    puts
    end

    then post the results for this file here. Then also post what you think
    the true filename is.

    Then you can see whether: (1) Dir.[] is returning the correct sequence
    of bytes for the filename or not; and (2) Dir.[] is tagging the string
    with the correct encoding or not.

    (This is one of the thousands of cases I did *not* document in
    string19.rb; I did some of the core methods on String, but of course
    every method in every class which either returns a string or accepts a
    string argument needs to document how it handles encodings)

    > as output of this script (which is then not able to rename any wrong
    > file,
    > because tr! seem to not work either on name) :
    >
    > path = ARGV[0] || "/"
    >
    > ALLOWED_CHARS = "A-Za-z0-9 %#:$@?!=+~&|'()\\[\\]{}.,\r_-"
    >
    > Dir["#{File.expand_path(path)}/**/*"].each { |f|
    > name = File.basename(f)
    > unless name =~ /^[#{ALLOWED_CHARS}]+$/
    > puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
    > ">\\1<")
    >
    > if name.tr!('éèê', 'e') =~ /^[#{ALLOWED_CHARS}]+$/ # Here it is
    > not
    > complete, it is just a test, but it doesn't work even for 'filéname'
    > File.rename(f, File.dirname(f) + '/' + name)
    > puts "\trenamed in #{name}"
    > break
    > end
    > end
    > }


    What error do you get? Is it failing to match the é at all (tr! returns
    nil), or is an encoding error raised in tr!, or is an error raised by
    File.rename ?
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Dec 28, 2009
    #9
  10. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    2009/12/28 Brian Candler <>

    > Benoit Daloze wrote:
    > > But then I come with things like:
    > > /Users/benoitdaloze/Library/GlestGame/data/lang/espan>=CC=83<ol.lng
    > >
    > > (The ~ is separated from the n and then is not =C3=B1). The Regexp is a=

    cting
    > > like
    > > it is 2 different characters. How to handle that easily? I tried to
    > > change
    > > the script encoding in MacRoman, but it produced an error of bad
    > > encoding
    > > not matching UTF-8.

    >
    > I don't know what you mean. If Dir.[] tells you that the file name is
    > <e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
    > filename?
    >
    > I suggest you try something like this:
    >
    > puts "Source encoding: #{"".encoding}"
    > puts "External encoding: #{Encoding.default_external}"
    > Dir["*.lng"] do |fn|
    > puts "Name: #{fn.inspect}"
    > puts "Encoding: #{fn.encoding}"
    > puts "Chars: #{fn.chars.to_a.inspect}"
    > puts "Codepoints: #{fn.codepoints.to_a.inspect}"
    > puts "Bytes: #{fn.bytes.to_a.inspect}"
    > puts
    > end
    >
    > then post the results for this file here. Then also post what you think
    > the true filename is.
    >


    The true filename is (from the Finder and Terminal):
    -rw-r--r--@ 1 benoitdaloze staff 3758 Jul 17 2008 espa=C3=B1ol.lng
    So, with the '=C3=B1'.

    I don't know which is the encoding of the filename on HFS+, from Wikipedia
    it s said as UTF-16, with Decomposition:
    "names which are also character encoded in
    UTF-16<http://en.wikipedia.org/wiki/UTF-16>and normalized to a form
    very nearly the same as Unicode
    Normalization Form D (NFD)<http://en.wikipedia.org/wiki/Unicode_normalizati=
    on>
    [4] <http://en.wikipedia.org/wiki/HFS_Plus#cite_note-3> (which means that
    precomposed characters like =C3=A9 are decomposed in the HFS+ filename and
    therefore count as two
    characters[5]<http://en.wikipedia.org/wiki/HFS_Plus#cite_note-4>"
    So, that's probably a problem of encoding for Dir.[]

    I changed a little the script, to compare with a String hard-coded inside
    the script (rn =3D "espa=C3=B1ol.lng")

    ruby 1.9.2dev (2009-12-11 trunk 26067) [x86_64-darwin10.2.0]

    Source encoding: UTF-8
    External encoding: UTF-8

    Format:
    String in the code
    filename from Dir[]

    String equality: false

    Name:
    "espa=C3=B1ol.lng"
    "espan=CC=83ol.lng"
    Encoding:
    UTF-8
    UTF-8
    Chars:
    ["e", "s", "p", "a", "=C3=B1", "o", "l", ".", "l", "n", "g"]
    ["e", "s", "p", "a", "n", "=CC=83", "o", "l", ".", "l", "n", "g"]
    Codepoints:
    [101, 115, 112, 97, 241, 111, 108, 46, 108, 110, 103]
    [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
    Bytes:
    [101, 115, 112, 97, 195, 177, 111, 108, 46, 108, 110, 103]
    [101, 115, 112, 97, 110, 204, 131, 111, 108, 46, 108, 110, 103]


    > Then you can see whether: (1) Dir.[] is returning the correct sequence
    > of bytes for the filename or not; and (2) Dir.[] is tagging the string
    > with the correct encoding or not.
    >


    (1) Dir[] seems to return a correct String in UTF-8, while being different
    (!!) from a String inside in UTF-8
    But looking at the codepoints and bytes, it's very different ...

    (2) That's probably the case, let's look by forcing the encoding to
    MacRoman:
    Or not ... making crazy results like: "espan\xCC\x83ol.lng" or
    "espan\u0303ol.lng"

    Well, this is out of my poor knowledge of encoding I'm afraid :(

    The most frustrating is it's printing the same...

    P.S.: Well I got also filenames with "\r", quite weared,no? ("Target
    Application Alias\r", and it "\r" is shown as "?" in the Terminal)

    (This is one of the thousands of cases I did *not* document in
    > string19.rb; I did some of the core methods on String, but of course
    > every method in every class which either returns a string or accepts a
    > string argument needs to document how it handles encodings)
    >
    > > as output of this script (which is then not able to rename any wrong
    > > file,
    > > because tr! seem to not work either on name) :
    > >
    > > path =3D ARGV[0] || "/"
    > >
    > > ALLOWED_CHARS =3D "A-Za-z0-9 %#:$@?!=3D+~&|'()\\[\\]{}.,\r_-"
    > >
    > > Dir["#{File.expand_path(path)}/**/*"].each { |f|
    > > name =3D File.basename(f)
    > > unless name =3D~ /^[#{ALLOWED_CHARS}]+$/
    > > puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/=

    ,
    > > ">\\1<")
    > >
    > > if name.tr!('=C3=A9=C3=A8=C3=AA', 'e') =3D~ /^[#{ALLOWED_CHARS}=

    ]+$/ # Here it is
    > > not
    > > complete, it is just a test, but it doesn't work even for 'fil=C3=A9nam=

    e'
    > > File.rename(f, File.dirname(f) + '/' + name)
    > > puts "\trenamed in #{name}"
    > > break
    > > end
    > > end
    > > }

    >
    > What error do you get? Is it failing to match the =C3=A9 at all (tr! retu=

    rns
    > nil), or is an encoding error raised in tr!, or is an error raised by
    > File.rename ?
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    > Yes, tr! returns nil on name.tr!('=C3=B1', 'n'), but it would work on a S=

    tring
    inside the script (eg: "e=C3=B1o".tr!('=C3=B1', 'n'))
    Benoit Daloze, Dec 28, 2009
    #10
  11. Re: ruby 1.9 hates you and me and the encodings we rode in o

    Benoit Daloze wrote:
    > 2009/12/27 Brian Candler <>
    >
    >>
    >> The solution I use is simple: stick to ruby 1.8.x. When that branch
    >> dies, perhaps reia will be ready. If not I'll move to something else.
    >>
    >> IMO, both python 3 and erlang have got the right idea when it comes to
    >> handling UTF8.
    >> --
    >> Posted via http://www.ruby-forum.com/.
    >>
    >>

    > Hi,
    >
    > I got this kind of problem yesterday too.
    >
    > While taking some file names with Dir#[], I got some special results.
    >
    > I was searching for "bad" file names, I mean file names with é,ê or
    > whatever. When I print the String given in the block directly, no
    > problem.
    >
    > But then I come with things like:
    > /Users/benoitdaloze/Library/GlestGame/data/lang/espan>̃<ol.lng
    >
    > (The ~ is separated from the n and then is not ñ). The Regexp is acting
    > like
    > it is 2 different characters.


    And so it is. If memory serves, Mac OS X stores filenames in normal
    form D.

    > How to handle that easily?


    Normalize to normal form C instead.

    Best,
    --
    Marnen Laibow-Koser
    http://www.marnen.org

    --
    Posted via http://www.ruby-forum.com/.
    Marnen Laibow-Koser, Dec 28, 2009
    #11
  12. DJ Jazzy Linefeed

    Bill Kelly Guest

    Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Brian Candler wrote:
    >
    > I got as far as recording 200 behaviours of String in ruby 1.9 before I
    > gave up:
    > http://github.com/candlerb/string19/blob/master/string19.rb
    >
    > The solution I use is simple: stick to ruby 1.8.x. When that branch
    > dies, perhaps reia will be ready. If not I'll move to something else.
    >
    > IMO, both python 3 and erlang have got the right idea when it comes to
    > handling UTF8.


    Could you summarize what you feel the key difference of
    the python 3 / erlang approach is, compared to ruby19 ?

    I'm a relative newbie in dealing with character encodings,
    but I do recall a few lengthy discussions on this list when
    ruby19's M17N was being developed, where the "UTF-8 only"
    approaches of some other languages were deemed insufficient
    for various reasons.

    However, my understanding is that one is supposed to be
    able to effectively make ruby behave as a "UTF-8 only"
    language if one makes sure external data is transcoded to
    UTF-8 at I/O boundaries.

    I realize there may be some caveats with regard to locale,
    although I invoke my ruby19 scripts with -EUTF-8:UTF-8.

    So far, my experience with ruby19 M17N has _not_ been
    problematic. The only difficulties I've encountered have
    been when dealing with external data in some unknown
    encoding, where I've had to do some programmatic guesswork
    and finagling to make sort of a best-effort conversion of
    the external data to UTF-8 at the I/O boundary.

    But that is something I can't imagine python or erlang
    helping me much with either.

    * * *

    Reflecting some more, I do recall that James Gray had
    remarked on the difficulty of modifying one of his libraries
    so that it would be effectively encoding agnostic, and be
    able to handle data in whatever encoding was thrown at it.

    So from that perspective I can see how a "UTF-8 only"
    approach at the language level should simplify things.

    But from my current perspective as an application developer
    who is taking the approach of ensuring all data read into
    my program is converted to UTF-8, I'm wondering if my
    experience is essentially similar to what it would be in
    a "UTF-8 only" language.


    Regards,

    Bill
    Bill Kelly, Dec 29, 2009
    #12
  13. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Bill Kelly wrote:
    > Brian Candler wrote:
    >
    >> I got as far as recording 200 behaviours of String in ruby 1.9 before I
    >> gave up:
    >> http://github.com/candlerb/string19/blob/master/string19.rb
    >>
    >> The solution I use is simple: stick to ruby 1.8.x. When that branch
    >> dies, perhaps reia will be ready. If not I'll move to something else.
    >>
    >> IMO, both python 3 and erlang have got the right idea when it comes to
    >> handling UTF8.
    >>

    >
    > Could you summarize what you feel the key difference of
    > the python 3 / erlang approach is, compared to ruby19 ?
    >


    Taking a UTF-8 approach is easier to implement because you enforce all
    strings to be UTF-8 and ignore when this doesn't work. Kind of like
    saying everything will be ASCII or converted to it ;)

    > I'm a relative newbie in dealing with character encodings,
    > but I do recall a few lengthy discussions on this list when
    > ruby19's M17N was being developed, where the "UTF-8 only"
    > approaches of some other languages were deemed insufficient
    > for various reasons.
    >


    Not everything maps one-to-one to UTF-8.

    > However, my understanding is that one is supposed to be
    > able to effectively make ruby behave as a "UTF-8 only"
    > language if one makes sure external data is transcoded to
    > UTF-8 at I/O boundaries.
    >


    That is pretty much it. The problem is that a lot of libraries still
    don't handle encodings. This results in some spurious errors when a
    function requiring compatible encoding operates on them[1]. The
    solution is to add support for handling encodings.

    Edward

    1. As appose to ruby 1.8 which would silently ignore actual errors
    caused by the use of incompatible encodings.
    Edward Middleton, Dec 29, 2009
    #13
  14. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    2009/12/28 Brian Candler <>

    > Benoit Daloze wrote:
    > > But then I come with things like:
    > > /Users/benoitdaloze/Library/GlestGame/data/lang/espan>=CC=83<ol.lng
    > >
    > > (The ~ is separated from the n and then is not =C3=B1). The Regexp is a=

    cting
    > > like
    > > it is 2 different characters. How to handle that easily? I tried to
    > > change
    > > the script encoding in MacRoman, but it produced an error of bad
    > > encoding
    > > not matching UTF-8.

    >
    > I don't know what you mean. If Dir.[] tells you that the file name is
    > <e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
    > filename?
    >
    > I suggest you try something like this:
    >
    > puts "Source encoding: #{"".encoding}"
    > puts "External encoding: #{Encoding.default_external}"
    > Dir["*.lng"] do |fn|
    > puts "Name: #{fn.inspect}"
    > puts "Encoding: #{fn.encoding}"
    > puts "Chars: #{fn.chars.to_a.inspect}"
    > puts "Codepoints: #{fn.codepoints.to_a.inspect}"
    > puts "Bytes: #{fn.bytes.to_a.inspect}"
    > puts
    > end
    >
    > then post the results for this file here. Then also post what you think
    > the true filename is.
    >
    > Then you can see whether: (1) Dir.[] is returning the correct sequence
    > of bytes for the filename or not; and (2) Dir.[] is tagging the string
    > with the correct encoding or not.
    >
    > (This is one of the thousands of cases I did *not* document in
    > string19.rb; I did some of the core methods on String, but of course
    > every method in every class which either returns a string or accepts a
    > string argument needs to document how it handles encodings)
    >
    > > as output of this script (which is then not able to rename any wrong
    > > file,
    > > because tr! seem to not work either on name) :
    > >
    > > path =3D ARGV[0] || "/"
    > >
    > > ALLOWED_CHARS =3D "A-Za-z0-9 %#:$@?!=3D+~&|'()\\[\\]{}.,\r_-"
    > >
    > > Dir["#{File.expand_path(path)}/**/*"].each { |f|
    > > name =3D File.basename(f)
    > > unless name =3D~ /^[#{ALLOWED_CHARS}]+$/
    > > puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/=

    ,
    > > ">\\1<")
    > >
    > > if name.tr!('=C3=A9=C3=A8=C3=AA', 'e') =3D~ /^[#{ALLOWED_CHARS}=

    ]+$/ # Here it is
    > > not
    > > complete, it is just a test, but it doesn't work even for 'fil=C3=A9nam=

    e'
    > > File.rename(f, File.dirname(f) + '/' + name)
    > > puts "\trenamed in #{name}"
    > > break
    > > end
    > > end
    > > }

    >
    > What error do you get? Is it failing to match the =C3=A9 at all (tr! retu=

    rns
    > nil), or is an encoding error raised in tr!, or is an error raised by
    > File.rename ?
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >

    " And so it is. If memory serves, Mac OS X stores filenames in normal
    form D.

    > How to handle that easily?


    Normalize to normal form C instead.

    Best,
    --
    Marnen Laibow-Koser "

    So that solved it, converting with Iconv.
    It would probably only works on Mac the encoding "UTF-8-MAC", but that for
    working on HFS+, so that's not really a problem.

    I found the documentation(in 1.9.2) of Iconv a little messy ...
    For exemple, typing 'ri Iconv#iconv'
    ------------------------------------------------------------ Iconv#iconv
    Iconv.iconv(to, from, *strs)

    and in 1.8.7
    ------------------------------------------------------------ Iconv#iconv
    iconv(str, start=3D0, length=3D-1)

    The result of ri(1.9.2) is the same of 'ri Iconv::iconv', what is kind of
    very different.

    Anyway, converting every filename using this works :)

    fn =3D Iconv.open("UTF-8", "UTF-8-MAC") { |iconv|
    iconv.iconv(fn)
    }
    or
    fn =3D Iconv.iconv("UTF-8", "UTF-8-MAC", fn).shift
    Benoit Daloze, Dec 29, 2009
    #14
  15. DJ Jazzy Linefeed

    Gary Watson Guest

    Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    I would like to chime in here and point out that sometimes you really
    want to ignore the errors caused by mis-matched encodings, (as was the
    case in my script where I just wanted to match filenames ending in *.mpg
    and really didn't care if the characters occurring before had funkiness
    going on with them.)

    1.8 had this kind of behavior by default, and I'm assuming python3 and
    erlang do too based on the descriptions given in this thread.

    As Matz pointed out, you can force ruby1.9 to have this behavior simply
    by using the ASCII-8 encoding rather than the default ASCII-7 encoding.
    Basically causes the regular expression engine to look at the string as
    a series of bytes again like it used to rather than freaking out when it
    see's something it doesn't expect in that last byte.

    I'm by no means knowledgeable about encodings, so take what I'm about to
    say with a grain of salt. It seems like the old way of handling
    encodings was permissive but imprecise, and the new way is precise but
    not always permissive. I like the ability to be precise because before
    that ability simply wasn't an option, however, since allot of people
    seem to be confused by the default behavior why not make the default
    behavior permissive and set it up so that IF YOU WANT to be precise you
    can enable the proper encodings that ensure that behavior? To me this
    seems to fall in with the principal of least surprise. (Sorry for
    quoting it, I know it's over-quoted).

    What do people think?

    Regards
    Gary


    Edward Middleton wrote:
    > Bill Kelly wrote:
    >>> handling UTF8.
    >>>

    >>
    >> Could you summarize what you feel the key difference of
    >> the python 3 / erlang approach is, compared to ruby19 ?
    >>

    >
    > Taking a UTF-8 approach is easier to implement because you enforce all
    > strings to be UTF-8 and ignore when this doesn't work. Kind of like
    > saying everything will be ASCII or converted to it ;)
    >
    >> I'm a relative newbie in dealing with character encodings,
    >> but I do recall a few lengthy discussions on this list when
    >> ruby19's M17N was being developed, where the "UTF-8 only"
    >> approaches of some other languages were deemed insufficient
    >> for various reasons.
    >>

    >
    > Not everything maps one-to-one to UTF-8.
    >
    >> However, my understanding is that one is supposed to be
    >> able to effectively make ruby behave as a "UTF-8 only"
    >> language if one makes sure external data is transcoded to
    >> UTF-8 at I/O boundaries.
    >>

    >
    > That is pretty much it. The problem is that a lot of libraries still
    > don't handle encodings. This results in some spurious errors when a
    > function requiring compatible encoding operates on them[1]. The
    > solution is to add support for handling encodings.
    >
    > Edward
    >
    > 1. As appose to ruby 1.8 which would silently ignore actual errors
    > caused by the use of incompatible encodings.


    --
    Posted via http://www.ruby-forum.com/.
    Gary Watson, Dec 29, 2009
    #15
  16. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Gary Watson wrote:
    > I'm by no means knowledgeable about encodings, so take what I'm about to
    > say with a grain of salt. It seems like the old way of handling
    > encodings was permissive but imprecise, and the new way is precise but
    > not always permissive. I like the ability to be precise because before
    > that ability simply wasn't an option, however, since allot of people
    > seem to be confused by the default behavior why not make the default
    > behavior permissive and set it up so that IF YOU WANT to be precise you
    > can enable the proper encodings that ensure that behavior? To me this
    > seems to fall in with the principal of least surprise. (Sorry for
    > quoting it, I know it's over-quoted).


    I guess the problem is that if you do this no libraries will make an
    effort to support encodings and you will lose all the advantages of
    proper encoding handling. I have to say, I cringed when the idea of
    handling encodings properly came up, because it is different from ruby
    1.8 and the transition was going to be difficult. Having said that, if
    you are going to support encodings this is probably the best way to do
    it, and in reality it is not that hard to get it right.

    Edward
    Edward Middleton, Dec 29, 2009
    #16
  17. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Bill Kelly wrote:
    >> IMO, both python 3 and erlang have got the right idea when it comes to
    >> handling UTF8.

    >
    > Could you summarize what you feel the key difference of
    > the python 3 / erlang approach is, compared to ruby19 ?


    As far as I can tell, both have two distinct data structures. One
    represents a binary object: a string of bytes. The other represents a
    textual string, a string of UTF-8 codepoints. (In the case of erlang,
    these are "binaries" and "lists" respectively).

    ruby 1.9 has one String which tries to do both jobs. I commonly deal
    with binary data: ASN1 encodings, PDFs, JPGs, firmware images, ZIP
    files, and so on. And yet ruby 1.9 has it now deeply embedded that all
    data is text (which is not clearly true: rather the converse, all text
    is data). At best you can get ruby 1.9 to tell you that your data is
    "ASCII-8BIT", even when it has nothing to do with ASCII whatsoever.

    I really miss having an object which simply represents a "sequence of
    bytes". Of course ruby 1.9 can do it, if you jump through the right
    hoops.

    I really miss being able to look at a simple expression such as
    a = b + c
    when I know that both b and c are String objects, and being able to say
    for definite whether or not it will raise an exception.

    > However, my understanding is that one is supposed to be
    > able to effectively make ruby behave as a "UTF-8 only"
    > language if one makes sure external data is transcoded to
    > UTF-8 at I/O boundaries.


    If you jump through the right hoops, you can do this. If you omit any of
    the hoops, your program may work on some systems but not on others. ruby
    1.9's behaviour is environment-sensitive.

    But the worst part of all this is that it's totally undocumented. Look
    into the 'ri' pages for most of ruby core, for any method which either
    takes a string, returns a string, or acts on a string, and you are
    unlikely to find any definition of its encoding-related behaviour,
    including under what circumstances it may raise an exception.

    By tagging every string with its own encoding, ruby 1.9 is solving a
    problem that does not exist: that is, how do you write a program which
    juggles multiple strings in different encodings all at the same time?

    And as the OP has discovered, the built-in support is often incomplete
    so that you have to use libraries like Iconv anyway.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Dec 29, 2009
    #17
  18. DJ Jazzy Linefeed

    Tony Arcieri Guest

    Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    [Note: parts of this message were removed to make it a legal post.]

    On Tue, Dec 29, 2009 at 11:06 AM, Brian Candler <> wrote:

    > By tagging every string with its own encoding, ruby 1.9 is solving a
    > problem that does not exist: that is, how do you write a program which
    > juggles multiple strings in different encodings all at the same time?
    >


    To play devil's advocate here, Japanese users do routinely have to deal with
    multiple different encodings... Shift JIS on Windows/Mac, EUC-JP on *IX, and
    ISO-2022-JP for email (if I even got that correct, it's somewhat hard to
    keep track). And then on top of all of that there's Unicode in all its
    various forms...

    While I would personally never choose M17n as the solution for my own
    language I can see why it makes sense for a language which originated in and
    is popular in Japan. The encoding situation over there is something of a
    mess.

    --
    Tony Arcieri
    Medioh! A Kudelski Brand
    Tony Arcieri, Dec 29, 2009
    #18
  19. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Tony Arcieri wrote:
    > To play devil's advocate here, Japanese users do routinely have to deal
    > with
    > multiple different encodings... Shift JIS on Windows/Mac, EUC-JP on *IX,
    > and
    > ISO-2022-JP for email


    Sure; and maybe they even want to process these formats without a
    round-trip to UTF8. (By the way, ruby 1.9 *can't* handle Shift JIS
    natively)

    I want a programming language which (a) handles strings of bytes, and
    (b) does so with simple, understandable, and predictable semantics: for
    example, concat string 1 with string 2 to make string 3. Is that too
    much to ask?

    Anyway, I'll shut up now.
    --
    Posted via http://www.ruby-forum.com/.
    Brian Candler, Dec 29, 2009
    #19
  20. Re: ruby 1.9 hates you and me and the encodings we rode in on so just get used to it.

    Hi,

    I think you're quite a little pessimist here :)

    Until my post on this subject, I have never been complaining far from that,
    and enjoyed to play with =E2=88=91, =E2=88=86 and so on.

    And I was not complaining, jsut asking how to solve that (The fact it didn'=
    t
    handle the normalization form C is quite logical I think, no language would
    do that easily).

    I think having Unicode support is something very useful. Look for
    example(even if it is a bad one) PHP and mb_* functions and all encoding
    functions, scary, no? Well, I think it's quite intuitive how it is for the
    moment, and most of the time doing concatenation is not a problem at all.

    So, globally I think a good encoding support is really important, while
    being not useful everyday.

    Regards,

    B.D.

    2009/12/29 Brian Candler <>

    > Tony Arcieri wrote:
    > > To play devil's advocate here, Japanese users do routinely have to deal
    > > with
    > > multiple different encodings... Shift JIS on Windows/Mac, EUC-JP on *IX=

    ,
    > > and
    > > ISO-2022-JP for email

    >
    > Sure; and maybe they even want to process these formats without a
    > round-trip to UTF8. (By the way, ruby 1.9 *can't* handle Shift JIS
    > natively)
    >
    > I want a programming language which (a) handles strings of bytes, and
    > (b) does so with simple, understandable, and predictable semantics: for
    > example, concat string 1 with string 2 to make string 3. Is that too
    > much to ask?
    >
    > Anyway, I'll shut up now.
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >
    Benoit Daloze, Dec 30, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kasey

    Response.Redirect hates me

    Kasey, Oct 4, 2006, in forum: ASP .Net
    Replies:
    4
    Views:
    328
    Gozirra
    Oct 6, 2006
  2. hlinzhou

    windows hates signal?

    hlinzhou, Feb 10, 2006, in forum: C Programming
    Replies:
    8
    Views:
    498
    Michael Wojcik
    Feb 13, 2006
  3. Phlip

    assert_equal hates my CPU

    Phlip, Apr 14, 2005, in forum: Ruby
    Replies:
    12
    Views:
    175
    Phlip
    Apr 14, 2005
  4. James Edward Gray II

    Rake Hates Me Today

    James Edward Gray II, Nov 15, 2005, in forum: Ruby
    Replies:
    4
    Views:
    100
    Jim Weirich
    Nov 16, 2005
  5. Ari Brown

    My CPU Hates Me

    Ari Brown, Jul 6, 2007, in forum: Ruby
    Replies:
    4
    Views:
    111
    Ari Brown
    Jul 7, 2007
Loading...

Share This Page