Ruby 1.8 - character encoding

  • Thread starter Thomas Thomassen
  • Start date
T

Thomas Thomassen

I write Ruby plugins for Google Sketchup.

Sketchup uses UTF-8 strings and passes this to ruby (1.8) - which
handles Strings as simple series of bytes. This caused problems when I
tried to pass a String I got from Sketchup which contained a file path
with some Norwegian letters. (æøåÆØÅ) as ruby then raised an error
saying the file/path didn't exist.

This was because æøåÆØÅ lies outside the ASCII character set so it was
returned as double byte characters in UTF-8.


Searching the net I found some hacks that converted UTF-8 into single
byte characters: str_utf8.unpack('U*').pack('C*')

The Norwegian characters lies outside the ASCII range, but still they
get packed into single bytes characters that the File classes can
handle.


Example:
'æøåÆØÅ'.length # <- all these characters causes the File class to fail

'æøåÆØÅ'.unpack('U*').pack('C*').length # <- File class now can handle
this

So it seems that the File class doesn't just handle ASCII, but maybe
ANSI (Windows-1252) or ISO-8859-1. Or does this depend on some system
setting?

My tests has been on a Norwegian Windows XP system with Norwegian
locale. Default language for applications that doesn't support Unicode
is also set to Norwegian.


To summon up what I'm trying to work out is how UTF-8 characters above
the ASCII range (0-127) is mapped to the 128-255 range. Does the 128-255
range refer to ANSI (1252) or ISO-8859-1? <- and is this due to system
settings?
 
T

Thomas Thomassen

Looking at http://en.wikipedia.org/wiki/ISO-8859-1
There's an extra character code besides the code that equal to the ANSI
code. It consist of 3 integers ranging from 0-7. This code can be used
in Ruby in conjunction with the escape character:

ANSI ISO-8859-1
--------------------
æ - 230 - 230 / 346
ø - 248 - 248 / 370
Ã¥ - 229 - 229 / 345
Æ - 198 - 198 / 306
Ø - 216 - 216 / 330
Ã… - 197 - 197 / 305

"\306".length # Code for Æ

Since this code doesn't exist in ANSI it seem to me that Ruby interprets
ISO-8859 encoding. But I'm still wondering if this is system
dependant...
 
T

Thomas Thomassen

Sorry, I just realised that the extra number was the octal variant.
So it could still be ANSI...
 
G

Gregory Brown

I have seen that series. I still can't work out how Ruby determines what
UTF-8 character to map to the 128-255 spaces.

I missed why you wouldn't just set $KCODE="U" and stick w. UTF-8?

Anyway, I *think* chars.pack("C*") is going to give you ISO-8859-1 but
someone else will need to verify for you.

-greg
 
G

Gregory Brown

I =A0missed why you wouldn't just set $KCODE=3D"U" and stick w. UTF-8?

Anyway, I *think* chars.pack("C*") is going to give you ISO-8859-1 but
someone else will need to verify for you.

Also, since you know the original encoding, you can use IConv to
explicitly convert to whatever you want.

-greg
 
J

James Gray

Searching the net I found some hacks that converted UTF-8 into single
byte characters: str_utf8.unpack('U*').pack('C*')

What you are doing there is transcoding from UTF-8 to Latin-1 (or =20
ISO-8859-1). Here's the proof:

$ ruby -KU -r iconv -e 'utf8 =3D "=E6=F8=E5=C6=D8=C5"; p =20
utf8.unpack("U*").pack("C*") =3D=3D Iconv.conv("ISO-8859-1", "UTF-8", =
utf8)'
true

James Edward Gray II=
 
T

Thomas Chust

2009/7/7 Thomas Thomassen said:
[...]
So it seems that the File class doesn't just handle ASCII, but maybe
ANSI (Windows-1252) or ISO-8859-1. Or does this depend on some system
setting?
[...]

Hello,

how Windows interprets file paths depends on which API calls you use
and on the current system locale. There is one set of Windows API
functions that always use UTF-16 and another one that always uses the
encoding associated with the current system locale.

I think Ruby indirectly accesses the latter API and doesn't do any
character set conversions before passing strings to the operating
system, but I'm not entirely sure there.

cu,
Thomas
 
T

Thomas Thomassen

Gregory said:
I missed why you wouldn't just set $KCODE="U" and stick w. UTF-8?
Because Sketchup uses Ruby to allow users to write plugins for the
applications. That flag, as far as I understand, is global and would
affect all scripts which might break a number of things. Also, the Ruby
1.8 version shipped with SU is not the whole package. Not sure if it's
possible even if I wanted it.

Anyway, I *think* chars.pack("C*") is going to give you ISO-8859-1 but
someone else will need to verify for you.

-greg
I'm currently looking into if the UTF-8 decimal codepoints (in range
128-255) are similar to the ISO-8859-1 or ANSI. That might be the
answer.
 
T

Thomas Thomassen

I've been doing some testing on the 128-255 range, and from what I can
gather all code points within the ISO 8859-1 range are identical with
UTF-8.
 
T

Thomas Thomassen

I checked the $KCODE variable and it returns "UTF8".

Now, what does that do to Ruby? Why does File.exist?('c:\Test æøå') fail
if it's UTF-8 encoded?
 
T

Thomas Thomassen

James said:
I answer that question in detail in this article:

http://blog.grayproductions.net/articles/the_kcode_variable_and_jcode_library


The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.

James Edward Gray II

What does the IO method require? Is it the Ruby IO methods or the system
methods it calls that doesn't handle UTF-8?
Windows' NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?
I'm trying to avoid transcoding to a 8bit only encoding as that'll just
cause grief when I encounter characters outside the range.
 
J

James Gray


That's a good question. I'm not sure what it does on Windows.
Is it the Ruby IO methods or the system methods it calls that =20
doesn't handle UTF-8?

I assume it's the underlying Windows API, though I'm just guessing =20
there.
Windows' NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?

I think it depends on which API methods you call, so I'm guessing you =20=

cannot do this. I think Ruby would need to be changed to use those =20
methods first.
I'm trying to avoid transcoding to a 8bit only encoding as that'll =20
just
cause grief when I encounter characters outside the range.

Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has =20=

been improved there, using the new encoding support. I don't know =20
that it has. I'm more just wondering out-loud=E2=80=A6

James Edward Gray II=
 
T

Thomas Thomassen

James said:
That's a good question. I'm not sure what it does on Windows.
Any clues what I does on OSX? The scripts will run on macs as well.
I assume it's the underlying Windows API, though I'm just guessing
there.


I think it depends on which API methods you call, so I'm guessing you
cannot do this. I think Ruby would need to be changed to use those
methods first.
Since NTFS supports UTF, then I guess it's the Ruby API that calls the
wrong WinAPIs?
Can I make my own API calls?

Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has
been improved there, using the new encoding support. I don't know
that it has. I'm more just wondering out-loud…

James Edward Gray II

The scripts I write is plugins for Google Sketchup - so the Ruby version
I have at disposal is the one Sketchup bundles - a partial 1.8 version.
While I've been searching for solutions I've noticed that v1.9 have
better support for various encoding, but unfortunately it's of no use
for me.

So my problem is that I have to deal with string data that comes from
Sketchup in UTF-8 format - might even have to deal with files and folder
that include characters outside the Windows1252 or ISO8859 range
(whatever the IO functions are using - I've not been able to pin-point
this.). If I get characters outside that range it's impossible to
transcode.
Andd, I also don't know what would happen for an eastern user. I'm
wondering if the IO functions would assume a different 8bit encoding...
 
B

Bill Kelly

From: "James Gray said:
I think it depends on which API methods you call, so I'm guessing you cannot do this. I think Ruby would need to be changed to
use those methods first.


Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has been improved there, using the new encoding support. I
don't know that it has. I'm more just wondering out-loud…

It's only begun to improve as of the ruby 1.9.2 development version.
(1.9.1 and earlier use the 8-bit windows file API routines.)

This ruby-core post provides a partial list of methods in 1.9.2dev
which now work with windows unicode paths, as of
1.9.2dev (2009-06-24) [i386-mswin32_71]

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24010


Regards,

Bill
 
T

Thomas Thomassen

Bill said:
From: "James Gray said:
Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has been improved there, using the new encoding support. I
don't know that it has. I'm more just wondering out-loud…

It's only begun to improve as of the ruby 1.9.2 development version.
(1.9.1 and earlier use the 8-bit windows file API routines.)

This ruby-core post provides a partial list of methods in 1.9.2dev
which now work with windows unicode paths, as of
1.9.2dev (2009-06-24) [i386-mswin32_71]

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24010


Regards,

Bill
I see. So Ruby calls Win32 APis that doesn't handle UTF-8. But what do
they use then? windows-1252? (Or would that be system dependant?) If
it's not a fixed character set, is there any way of finding that out -
so that I have a chance to try to transcode correctly?
 
T

Thomas Thomassen

I just tried on a Mac - It worked fine with Norwegian letters there. So
it seems that Ruby 1.8 on OSX calls UTF-8 aware IO system calls.

Then it's the question of what encoding is used on Windows.

And can I can UTF-8 aware Windows IO API methods myself - bypassing the
built in ruby?
 
B

Bill Kelly

From: "Thomas Thomassen said:
Any clues what I does on OSX? The scripts will run on macs as well.

Unlike that other OS, both OS X and Linux have taken an approach
I like to refer to as, NOT MIND-NUMBINGLY STUPID.

In OS X and Linux, one can use the same API calls one has always
used, as they are now UTF-8 savvy.

Since NTFS supports UTF, then I guess it's the Ruby API that calls the
wrong WinAPIs?
Can I make my own API calls?

In ruby 1.8 embedded into our C++ application, I've created hooks
so that I can call our unicode-savvy C++ routines from ruby.

I suppose it may be possible to do this without involving a
ruby C extension, assuming the ruby Win32API module can
be made to call routines like _wopen and such. I haven't tried that.

The scripts I write is plugins for Google Sketchup - so the Ruby version
I have at disposal is the one Sketchup bundles - a partial 1.8 version.
While I've been searching for solutions I've noticed that v1.9 have
better support for various encoding, but unfortunately it's of no use
for me.

So my problem is that I have to deal with string data that comes from
Sketchup in UTF-8 format - might even have to deal with files and folder
that include characters outside the Windows1252 or ISO8859 range
(whatever the IO functions are using - I've not been able to pin-point
this.). If I get characters outside that range it's impossible to
transcode.
Andd, I also don't know what would happen for an eastern user. I'm
wondering if the IO functions would assume a different 8bit encoding...

For best 8-bit compatibility you'll want to encode to Windows1252.

But, this (of course) won't help at all with chinese characters, etc.


Regards,

Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top