utf-8 & Range under eruby (possibly Rails) problems

J

Johan Sörensen

Hi,

I'm having some issues with a range that truncates texts, the below is
a (very) simplified version of the truncate method thats used in rails
(which is where I discovered this):

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på


Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.
I was unable to do this properly in irb, since my terminal (or irb)
would act funny on the öäå's..

--johan
 
C

Carlos

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på


Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.

It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Here is something you can start from:

module UTF8Str
def [] (*params)
if params.all? { |p| Integer===p } ||
params.size==1 && Range===params[0]
res = self.unpack("U*").[](*params)
res = [res] unless Array===res
return res.pack("U*")
end
super
end
end

a="áéióúü"
a.extend UTF8Str

puts a[0], a[1], a[2], a[3], a[4], a[1,2], a[1..2], a[-1]


Good luck.

--
 
J

Johan Sörensen

It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Is this expected behaviour?

-- johan
 
C

Carlos

It is a Ruby feature :). Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Well, because "ö".length == 2 (UTF-8 is a multibyte encoding). Your range's
end was falling between the two bytes of the "ö".

--
 
M

Michael DeHaan

Someone on PerlMonks taught me a neat trick. A regex split about
nothing returns an array of one-character strings. It's true for Ruby
as well ... So these indexing routines are really simple.

some_string.split(//).each { |c|
...
}

# or ... some_string.split(//)[5]



Carlos> It is a Ruby feature :). Indices in strings are bytes, not
chars. For the
Carlos> moment, you must develop your own indexing routines for UTF-8 strings
Carlos> (notice that String#[/regex/] works, because regexes are UTF-8 aware).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top