utf-8 & Range under eruby (possibly Rails) problems

Johan Sörensen · Dec 17, 2004

Hi,

I'm having some issues with a range that truncates texts, the below is
a (very) simplified version of the truncate method thats used in rails
(which is where I discovered this):

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på

Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.
I was unable to do this properly in irb, since my terminal (or irb)
would act funny on the öäå's..

--johan

Carlos · Dec 17, 2004

# this in an utf-8 encoded erb template (a rails "view" in my case)
<% text = "Eftersom jag jobbar som kontruktör/ingenjör på dagarna och
hackar cocoa" -%>
<%= text[0..47] %>
<br />
<%= text[0..48] %>
<br />
# notice the 'o' in ingenjor instead of 'ö'
<% othertext = "Eftersom jag jobbar som kontruktör/ingenjor på dagarna
och hackar cocoa" -%>
<%= othertext[0..47] %>

#produces this (the last character on the first line will display as
a "funny character" in browsers)

Eftersom jag jobbar som kontruktör/ingenjör p?
Eftersom jag jobbar som kontruktör/ingenjör på
Eftersom jag jobbar som kontruktör/ingenjor på

Is this a possible bug in Ruby (1.8.1) or could it be something with
Rails that gets in the way, I can reproduce this across two servers
and in webrick.

It is a Ruby feature

. Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Here is something you can start from:

module UTF8Str
def [] (*params)
if params.all? { |p| Integer===p } ||
params.size==1 && Range===params[0]
res = self.unpack("U*").[](*params)
res = [res] unless Array===res
return res.pack("U*")
end
super
end
end

a="áéióúü"
a.extend UTF8Str

puts a[0], a[1], a[2], a[3], a[4], a[1,2], a[1..2], a[-1]

Good luck.

--

Johan Sörensen · Dec 17, 2004

It is a Ruby feature . Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Is this expected behaviour?

-- johan

Carlos · Dec 17, 2004

It is a Ruby feature . Indices in strings are bytes, not chars. For the
moment, you must develop your own indexing routines for UTF-8 strings
(notice that String#[/regex/] works, because regexes are UTF-8 aware).

Click to expand...

I see.

The thing that has me confused though, is that it's not consistant
since it'll only happen on the first line in the example I gave.
I expand the range a little and it'll pass through untouched. I change
either off the preceeding ö's it'll pass through untouched.

Well, because "ö".length == 2 (UTF-8 is a multibyte encoding). Your range's
end was falling between the two bytes of the "ö".

--

Michael DeHaan · Dec 17, 2004

Someone on PerlMonks taught me a neat trick. A regex split about
nothing returns an array of one-character strings. It's true for Ruby
as well ... So these indexing routines are really simple.

some_string.split(//).each { |c|
...
}

# or ... some_string.split(//)[5]

Carlos> It is a Ruby feature

. Indices in strings are bytes, not
chars. For the
Carlos> moment, you must develop your own indexing routines for UTF-8 strings
Carlos> (notice that String#[/regex/] works, because regexes are UTF-8 aware).

Add recipes using JavaScript in table	20	Apr 17, 2023
ruby unicode/string explosion (0xFF in utf-8)	2	Dec 11, 2010
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
UTF-8 output problems	2	Mar 10, 2007
Malformed UTF-8?	4	Mar 11, 2005
[ANN] Rails 1.2: REST admiration, HTTP lovefest, and UTF-8 celebrations	1	Jan 19, 2007
WIN32OLE doesn't seem to support UTF-8.	2	Sep 8, 2005
UTF-8 in basic CGI mode	2	Jan 15, 2008

utf-8 & Range under eruby (possibly Rails) problems

Johan Sörensen

Carlos

Johan Sörensen

Carlos

Michael DeHaan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads