encoding problem with tr() and hash keys

Discussion in 'Ruby' started by Do One, Feb 21, 2009.

  1. Do One

    Do One Guest

    Please help to understand solution to this problem (ruby 1.9.1):

    In utf-8 environment I do:

    irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
    => {"a"=>1, "Ä"=>2}
    irb(main):122:0> h.key? "a".tr("z", "\u0101")
    => false <--- wrong!
    irb(main):123:0> h.key? "\u0101".tr("z", "\u0101")
    => true

    So after I change utf-8 string without extended chars in it with tr(),
    where second character set is having extended chars, new string is not
    found in hash.

    Boths string are same in Marshal encoding:

    irb(main):124:0> Marshal.dump "a".tr("\u0101", "\u0101")
    => "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"
    irb(main):126:0> Marshal.dump "a"
    => "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"


    Question is how I should code using tr() that new string will be found
    in hash?

    And I think this is bug in ruby, because it is completely not expected
    behavior.
    --
    Posted via http://www.ruby-forum.com/.
     
    Do One, Feb 21, 2009
    #1
    1. Advertising

  2. Do One

    7stud -- Guest

    Do One wrote:
    > Please help to understand solution to this problem (ruby 1.9.1):
    >
    > In utf-8 environment I do:
    >
    > irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
    > => {"a"=>1, "Ä"=>2}
    > irb(main):122:0> h.key? "a".tr("z", "\u0101")
    > => false <--- wrong!
    >



    h = {"a" => 1, "b" => 2}

    p "a".tr("z", "\u0101") #"a"

    puts h.key?("a".tr("z", "x")) #true

    ruby 1.8.2

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Feb 21, 2009
    #2
    1. Advertising

  3. Do One

    7stud -- Guest

    7stud -- wrote:
    > h = {"a" => 1, "b" => 2}
    >
    > p "a".tr("z", "\u0101") #"a"
    >
    > puts h.key?("a".tr("z", "x")) #true
    >
    > ruby 1.8.2


    Whoops. Make that:


    h = {"a" => 1, "\u0101" => 2}

    p "a".tr("z", "\u0101") #=>"a"

    puts h.key?("a".tr("z", "\u0101")) #=>true

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Feb 21, 2009
    #3
  4. Do One

    Do One Guest

    Re: encoding problem with tr() and hash keys (1.9.1)

    Problem described is under modern ruby 1.9.1 in utf-8 environment.

    7stud -- wrote:
    >> ruby 1.8.2

    >
    > Whoops. Make that:


    ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-linux]
    irb(main):001:0> h = {"a" => 1, "\u0101" => 2}
    => {"a"=>1, "u0101"=>2}

    See? It even dont understand unicode escape sequence \uXXXX.


    Do One wrote:
    > Please help to understand solution to this problem (ruby 1.9.1):
    >
    > In utf-8 environment I do:
    >
    > irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
    > => {"a"=>1, "Ä"=>2}
    > irb(main):122:0> h.key? "a".tr("z", "\u0101")
    > => false <--- wrong!
    > irb(main):123:0> h.key? "\u0101".tr("z", "\u0101")
    > => true
    >
    > So after I change utf-8 string without extended chars in it with tr(),
    > where second character set is having extended chars, new string is not
    > found in hash.
    >
    > Boths string are same in Marshal encoding:
    >
    > irb(main):124:0> Marshal.dump "a".tr("\u0101", "\u0101")
    > => "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"
    > irb(main):126:0> Marshal.dump "a"
    > => "\x04\bI\"\x06a\x06:\rencoding\"\nUTF-8"
    >
    >
    > Question is how I should code using tr() that new string will be found
    > in hash?
    >
    > And I think this is bug in ruby, because it is completely not expected
    > behavior.


    --
    Posted via http://www.ruby-forum.com/.
     
    Do One, Feb 22, 2009
    #4
  5. Do One wrote:
    > Please help to understand solution to this problem (ruby 1.9.1):
    >
    > In utf-8 environment I do:
    >
    > irb(main):121:0> h = {"a" => 1, "\u0101" => 2}
    > => {"a"=>1, "Ä"=>2}
    > irb(main):122:0> h.key? "a".tr("z", "\u0101")
    > => false <--- wrong!
    > irb(main):123:0> h.key? "\u0101".tr("z", "\u0101")
    > => true


    Perhaps describe your environment in more detail? It works for me:

    $ irb19
    irb(main):001:0> h = {"a" => 1, "\u0101" => 2}
    => {"a"=>1, "Ä"=>2}
    irb(main):002:0> h.key?("a")
    => true
    irb(main):003:0> h.key?("\u0101")
    => true
    irb(main):004:0> h.key?("a".tr("z", "\u0101"))
    => true
    irb(main):005:0> h.key? "a".tr("z", "\u0101")
    => true
    irb(main):006:0> h.key? "z".tr("z", "\u0101")
    => true
    irb(main):007:0>

    This is Ubuntu Hardy, ruby 1.9.1 (2008-12-01 revision 20438)
    [i686-linux] compiled from source. I think this is 1.9.1-preview2 rather
    than 1.9.1-p0.

    To eliminate problems with encoding, maybe try writing this as a script
    and running it from the command line:

    p h = {"a" => 1, "\u0101" => 2}
    p h.key?("a")
    p h.key?("\u0101")
    p h.key?("a".tr("z", "\u0101"))
    p h.key? "a".tr("z", "\u0101")
    p h.key? "z".tr("z", "\u0101")

    ruby19 test.rb
    ruby19 -Ku test.rb
    ruby19 --encoding UTF-8:UTF-8 test.rb

    to see if this makes any difference. On my machine at least, the -K and
    --encoding flags are not recognised by irb.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 23, 2009
    #5
  6. Brian Candler, Feb 23, 2009
    #6
  7. Do One

    Tom Link Guest

    Tom Link, Feb 23, 2009
    #7
  8. Do One

    Do One Guest

    Yes it was fixed yesterday with two consecutive patches, first one was
    not fixing it completely, but before I found how to reproduce a bug it
    is got fixed second time. (ruby 1.9.2 svn trunk)

    > Perhaps describe your environment in more detail? It works for me:


    How to reproduce a bug (to understand its traps) -

    1. utf-8 env:

    $ ruby -v
    ruby 1.9.1p0 (2009-01-30 revision 21907) [i686-linux]
    $ export LC_CTYPE=en_US.utf-8
    $ irb
    irb(main):001:0> {"a" => 1}.key? "a".tr("z", "\u0101")
    => false

    Reproduced. Without utf-8 env you just don't see it:

    $ export LC_CTYPE=en_US
    $ irb
    irb(main):001:0> {"a" => 1}.key? "a".tr("z", "\u0101")
    => true

    2. Even if your env is not utf-8 but your script have "encoding: utf-8"
    magic comment then bug will be there:

    $ cat a.rb
    #encoding: utf-8
    p ({"a" => 1}).key?("a".tr("z", "\u0101"))
    $ ruby a.rb
    false

    3. Or you are using -KU switch:

    $ ruby -KU -e 'p ({"a" => 1}).key?("a".tr("z", "\u0101"))'
    false


    I stuck on this by parsing word lists where some words having
    diacritical marks, some words getting worked out differently then
    others, code was correct and it was just plain crazy.
    --
    Posted via http://www.ruby-forum.com/.
     
    Do One, Feb 24, 2009
    #8
  9. Do One wrote:
    > Yes it was fixed yesterday with two consecutive patches, first one was
    > not fixing it completely, but before I found how to reproduce a bug it
    > is got fixed second time. (ruby 1.9.2 svn trunk)


    It looks like this craziness is core behaviour for ruby 1.9,
    unfortunately.

    Notice that in your script which reproduces the problem, the encodings
    of the two strings match. Results shown are for ruby 1.9.1p0 (2009-01-30
    revision 21907) [i686-linux]

    #encoding: utf-8
    a = "a"
    b = a.tr("z", "\u0101")
    h = {a => 1}
    p h.key?(a) #true
    p h.key?(b) #false !!

    p a #"a"
    p b #"a"
    p a.encoding #<Encoding:UTF-8>
    p b.encoding #<Encoding:UTF-8>

    p a == b #true
    p a.hash #137519702
    p b.hash #137519703 AHA!

    So two strings, with identical byte sequences and identical encodings,
    calculate different hashes. So there must be some hidden internal state
    in the string which affects the calculation of the hash. I presume this
    is the flag ENC_CODERANGE_7BIT.

    It's hard to test whether this flag has been set correctly, if
    String#encoding doesn't show it, so you have to use indirect methods
    like String#hash.

    But now I think I understand the problem, it's easy to find more
    examples of the same brokenness. Here's one:

    #encoding: utf-8
    a = "a"
    b = "aß"
    b = b.delete("ß")
    h = {a => 1}
    p h.key?(a) #true
    p h.key?(b) #false !!

    p a #"a"
    p b #"a"
    p a.encoding #<Encoding:UTF-8>
    p b.encoding #<Encoding:UTF-8>

    p a == b #true
    p a.hash #-590825394
    p b.hash #-590825393


    I wonder just how many other string methods are broken in this way? And
    how many extension writers are going to set this hidden flag correctly
    in their strings, if even the ruby core developers don't always do it?

    It looks like this flag is a bad optimisation.

    * It needs recalculating every time a string is modified (thus negating
    the benefits of the optimisation)

    * It introduces hidden state, which affects behaviour but cannot be
    directly tested

    * If the state is not set correctly *every* time a string is generated
    or modified - and this includes in all extension modules - then things
    break.

    Regards,

    Brian.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 24, 2009
    #9
  10. Do One

    Do One Guest

    Brian Candler wrote:
    > But now I think I understand the problem, it's easy to find more
    > examples of the same brokenness. Here's one:
    >
    > #encoding: utf-8
    > a = "a"
    > b = "aß"
    > b = b.delete("ß")
    > h = {a => 1}
    > p h.key?(a) #true
    > p h.key?(b) #false !!


    This is still false even in "fixed" 1.9.2dev. Probably you should report
    it. :)


    > I wonder just how many other string methods are broken in this way? And
    > how many extension writers are going to set this hidden flag correctly
    > in their strings, if even the ruby core developers don't always do it?


    Scary.
    --
    Posted via http://www.ruby-forum.com/.
     
    Do One, Feb 24, 2009
    #10
  11. Do One wrote:
    > Brian Candler wrote:
    >> But now I think I understand the problem, it's easy to find more
    >> examples of the same brokenness. Here's one:
    >>
    >> #encoding: utf-8
    >> a = "a"
    >> b = "aß"
    >> b = b.delete("ß")
    >> h = {a => 1}
    >> p h.key?(a) #true
    >> p h.key?(b) #false !!

    >
    > This is still false even in "fixed" 1.9.2dev. Probably you should report
    > it. :)


    It's Not My Problem[TM], because I don't use 1.9 and have no intention
    of doing so for the foreseeable future. The semantics of Strings are now
    so complex that they are not even documented (except as
    reverse-engineered by some third parties for commercial books) - so how
    can you complain when they do something you don't expect?

    Ruby <=1.8.6 is an old friend. But for me, Ruby >=1.9 is more like a
    Rottweiler. I'm sure Rottweilers can make great companions to the right
    sort of owners.

    B.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 24, 2009
    #11
  12. Do One

    Do One Guest

    Brian Candler wrote:
    > Do One wrote:
    >> This is still false even in "fixed" 1.9.2dev. Probably you should report
    >> it. :)

    >
    > It's Not My Problem[TM], because I don't use 1.9 and have no intention
    > of doing so for the foreseeable future. The semantics of Strings are now
    > so complex that they are not even documented (except as
    > reverse-engineered by some third parties for commercial books) - so how
    > can you complain when they do something you don't expect?
    >
    > Ruby <=1.8.6 is an old friend. But for me, Ruby >=1.9 is more like a
    > Rottweiler. I'm sure Rottweilers can make great companions to the right
    > sort of owners.


    Ok, let's not report it and see how long it will stay. :)
    --
    Posted via http://www.ruby-forum.com/.
     
    Do One, Feb 24, 2009
    #12
  13. Do One wrote:
    > Ok, let's not report it and see how long it will stay. :)


    Go ahead, but it doesn't fix the underlying problem. Do you want to test
    *every* method which returns a String? Do you want to do this for all
    third-party C extensions?
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 24, 2009
    #13
  14. On Feb 24, 2009, at 2:36 PM, Brian Candler wrote:

    > Do One wrote:
    >> Ok, let's not report it and see how long it will stay. :)

    >
    > Go ahead, but it doesn't fix the underlying problem. Do you want to
    > test
    > *every* method which returns a String? Do you want to do this for all
    > third-party C extensions?
    > --
    > Posted via http://www.ruby-forum.com/.
    >



    Then report the underlying problem.

    Regards,
    Florian

    --
    Florian Gilcher

    smtp:
    jabber:
    gpg: 533148E2
     
    Florian Gilcher, Feb 24, 2009
    #14
  15. Florian Gilcher wrote:
    > Then report the underlying problem.


    IMO, the underlying problems are that ruby 1.9's concept of Strings is
    (a) not properly thought-out and (b) totally undocumented. I don't think
    a ticket on redmine would be appreciated on either count.

    The idea that a string should carry along its encoding sounds great in
    principle, but dozens of questions come out from that: even simple ones
    like "what happens if I concatenate two strings with different
    encodings?" are not answered without experimentation. And then you start
    to uncover the rules about "compatible" encodings, automatic switching
    between some character sets and US-ASCII, which happens some times but
    not others, or is hidden away:

    #encoding: UTF-8
    a = "a"
    p a.encoding #<Encoding:UTF-8>
    p "#{a}".encoding #<Encoding:UTF-8>
    p /#{a}/.encoding #<Encoding:US-ASCII>

    -- Ruby could have decided up-front that this string was US-ASCII, but
    didn't. Except that it magically carries along some hidden knowledge
    that this particular string, whilst declared to be UTF-8, is also
    'compatible' with US-ASCII. Presumably, if you mutate it to include
    special characters, this compatibility is lost, and it you mutate it
    again to remove them, it is restored.

    But it's worse than that. With 1.8, someone could post a Ruby script on
    this mailing list, and I'd have high confidence that it would work
    exactly the same if I ran it on my machine against the same data. In
    1.9, all sorts of factors inherited from your environment may make the
    program either behave differently, or indeed crash, on one machine but
    not the other. Plenty of examples have been posted on this list. It
    makes test coverage very hard to achieve. You need to defend against it.
    In other words: it bites.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 24, 2009
    #15
  16. Do One

    James Gray Guest

    On Feb 24, 2009, at 8:47 AM, Brian Candler wrote:

    > IMO, the underlying problems are that ruby 1.9's concept of Strings is
    > (a) not properly thought-out=85


    This seems pretty silly. It took years to develop. You really =20
    believe they didn't consider what they were doing in that time?

    James Edward Gray II=
     
    James Gray, Feb 24, 2009
    #16
  17. James Gray wrote:
    > On Feb 24, 2009, at 8:47 AM, Brian Candler wrote:
    >
    >> IMO, the underlying problems are that ruby 1.9's concept of Strings is
    >> (a) not properly thought-out�

    >
    > This seems pretty silly. It took years to develop. You really
    > believe they didn't consider what they were doing in that time?


    Possibly they were so wrapped up in it that they didn't step back and
    look at the end product.

    ri19 describes a String like this:

    A +String+ object holds and manipulates an arbitrary sequence of
    bytes, typically representing characters. String objects may be
    created using +String::new+ or as literals.

    That makes sense, and actually this description is unchanged from 1.8.

    When you manipulate them, they don't *behave* at all like sequences of
    bytes, and this is intentional: they are intended to behave like
    sequences of characters. But in an attempt to DTRT in all situations,
    they behave in strange and unpredictable ways. By that I mean: *I* can't
    predict how simple expressions like "#{a}" or /#{a}/ will work (in terms
    of encodings) without actually trying them. And if I want to read binary
    data from STDIN, I have to jump through hoops to ensure it's not tainted
    with the wrong encoding.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Feb 24, 2009
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rp
    Replies:
    1
    Views:
    538
    red floyd
    Nov 10, 2011
  2. Xeno Campanoli
    Replies:
    16
    Views:
    268
    Martin DeMello
    Aug 25, 2005
  3. Alex Fenton

    Hash#values and Hash#keys order

    Alex Fenton, Apr 7, 2006, in forum: Ruby
    Replies:
    1
    Views:
    142
    George Ogata
    Apr 15, 2006
  4. Mage

    hash.keys and hash.values

    Mage, Aug 13, 2006, in forum: Ruby
    Replies:
    14
    Views:
    181
  5. Tim McDaniel

    Hash key types and equality of hash keys

    Tim McDaniel, Mar 1, 2012, in forum: Perl Misc
    Replies:
    2
    Views:
    794
    Tim McDaniel
    Mar 1, 2012
Loading...

Share This Page