Unicode in irb on windows (respectively script/console in instantrails)

M

michael.raidel

Hi everyone!

I have a problem with Unicode in irb on Windows. I recognized it when
trying to save an attribute of an ActiveRecord-Model with an umlaut
(for example "ü") in script/console. If the database connection is
encoded in utf8, everything after the umlaut gets truncated, in the
default encoding I get funny characters back. It doesn't matter if the
$KCODE is set to UTF8 or NONE, the character number stays the same
(also on plain irb)!

Does anyone has a hint on how to solve this? Of course I could try
things such as Cygwin, but I am trying to find an elegant solution for
Windows-Users, which eventually could merge in the next
InstantRails-release, if Curt agrees.

Thanks a lot,

Michael
 
A

Austin Ziegler

I have a problem with Unicode in irb on Windows. I recognized it when
trying to save an attribute of an ActiveRecord-Model with an umlaut
(for example "=FC") in script/console. If the database connection is
encoded in utf8, everything after the umlaut gets truncated, in the
default encoding I get funny characters back. It doesn't matter if the
$KCODE is set to UTF8 or NONE, the character number stays the same
(also on plain irb)!

The windows console -- also used by cygwin -- doesn't recognise UTF-8.
(That is, it's not possible to properly display UTF-8 in cmd.exe, at
least so far as I can tell.)

-austin
--=20
Austin Ziegler * (e-mail address removed) * http://www.halostatue.ca/
* (e-mail address removed) * http://www.halostatue.ca/feed/
* (e-mail address removed)
 
C

Chilkat Software

A DOS console displays characters according to the OEM code page. Here is
an example showing how to properly display a=20
string with 8bit chars (e.g. characters
with diacritics, or accent marks)...

# file: oemCodePage.rb

require 'chilkat'

# (The CkString class is freeware)
myStr =3D Chilkat::CkString.new()

# A DOS console does NOT display this correctly:
print "=E9 =F4 =E0 =E7\n"

# What we need is the OEM (DOS) code page...
# OEM code pages are listed here:
#=20
http://msdn.microsoft.com/library/default.asp?url=3D/library/en-us/intl/unic=
ode_81rn.asp
myStr.appendAnsi("=E9 =F4 =E0 =E7\n")

# Emit the string in the character encoding of your choice:
# ibm850 is the OEM code page for Latin1
print myStr.getEnc("ibm850")

# Chilkat supports these:
# us-ascii
# unicode
# unicodefffe
# iso-8859-1
# iso-8859-2
# iso-8859-3
# iso-8859-4
# iso-8859-5
# iso-8859-6
# iso-8859-7
# iso-8859-8
# iso-8859-9
# iso-8859-13
# iso-8859-15
# windows-874
# windows-1250
# windows-1251
# windows-1252
# windows-1253
# windows-1254
# windows-1255
# windows-1256
# windows-1257
# windows-1258
# utf-7
# utf-8
# utf-32
# utf-32be
# shift_jis
# gb2312
# ks_c_5601-1987
# big5
# iso-2022-jp
# iso-2022-kr
# euc-jp
# euc-kr
# macintosh
# x-mac-japanese
# x-mac-chinesetrad
# x-mac-korean
# x-mac-arabic
# x-mac-hebrew
# x-mac-greek
# x-mac-cyrillic
# x-mac-chinesesimp
# x-mac-romanian
# x-mac-ukrainian
# x-mac-thai
# x-mac-ce
# x-mac-icelandic
# x-mac-turkish
# x-mac-croatian
# asmo-708
# dos-720
# dos-862
# ibm037
# ibm437
# ibm500
# ibm737
# ibm775
# ibm850
# ibm852
# ibm855
# ibm857
# ibm00858
# ibm860
# ibm861
# ibm863
# ibm864
# ibm865
# cp866
# ibm869
# ibm870
# cp875
# koi8-r
# koi8-u
 
A

Austin Ziegler

The windows console -- also used by cygwin -- doesn't recognise UTF-8.
(That is, it's not possible to properly display UTF-8 in cmd.exe, at
least so far as I can tell.)

Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF8) w=
ith:

chcp 65001

There are some caveats, of course:

http://blogs.msdn.com/michkap/archive/2006/03/06/544251.aspx

-austin
--=20
Austin Ziegler * (e-mail address removed) * http://www.halostatue.ca/
* (e-mail address removed) * http://www.halostatue.ca/feed/
* (e-mail address removed)
 
D

David Vallner

--------------enig5BAD7457B47BBDA592CE45D0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Austin said:
=20
Ack my bad. I had forgotten: you can specify the UTF-8 codepage
(CP_UTF8) with:
=20
chcp 65001
=20
There are some caveats, of course:
=20
http://blogs.msdn.com/michkap/archive/2006/03/06/544251.aspx
=20

Also the good old combo of "mode con codepage select=3D65001".

http://msdn.microsoft.com/library/default.asp?url=3D/library/en-us/intl/u=
nicode_81rn.asp
lists pretty much all the numbers you can use. (The pain of navigating
to that on the MSDN website.)

Amusingly enough, none of those are even present anymore on WinXP Pro
x64. For yet more hilarity, the console is by default set to the DOS OEM
codepage of the given locale, instead of the newer ANSI ones that are
ISO extensions, which causes great fun when trying to use software
that's ever so smart and autodetects my locale as my preferred language
(Postgres, assorted GNU stuff being too clever by half) instead of using
the OS language version.

And "there are some caveats" is an understatement, the UTF-8 support in
the console is a sham - I couldn't get a trivial C program using
arbitrary combinations of tchar.h, wchar.h, -DUNICODE, cmd.exe, the
Windows console, a Cygwin and an MSYS rxvt to do something as daunting
as input random characters that aren't shared between Latin1 and Latin2
codepages, store them as multibyte internally, and then write them out
to a text file and to the console successfully without one step
breaking. The fact whole of CMD broke down in tears from changing that
setting is also worth noting - IIRC, had problems doing output
redirection to a file and whatnot (I can't play around with this without
setting up a virtual machine with a 32bit XP). Basically, the Path Less
Annoying is to only use the console for working in your "native"
codepage, and use a non-console tool for everything else.

end # of rant

David Vallner


--------------enig5BAD7457B47BBDA592CE45D0
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (MingW32)

iD8DBQFFUT+dy6MhrS8astoRAmPfAJoCUln9FPx8DYExQi7e9msv1vOUNgCfaoXR
xcbu7raVVAoX95XQGwpwRLQ=
=WsAE
-----END PGP SIGNATURE-----

--------------enig5BAD7457B47BBDA592CE45D0--
 
M

michael.raidel

Ack my bad. I had forgotten: you can specify the UTF-8 codepage (CP_UTF8) with:
chcp 65001

Thank you Austin for the nice hint!

The problem is, that as soon as I switch the codepage, irb (and also
script/console) stops working (it doesn't even start anymore, it just
quits immediately without an error-message).

Michael
 
A

Austin Ziegler

Thank you Austin for the nice hint!

The problem is, that as soon as I switch the codepage, irb (and also
script/console) stops working (it doesn't even start anymore, it just
quits immediately without an error-message).

That's one of the caveats mentioned: batch files no longer work.
I don't know why. However, if you have Ruby installed in C:\Ruby, you can do:

copy C:\Ruby\bin\irb C:\Ruby\bin\irb.rb
irb.rb

Or:

ruby C:\Ruby\bin\irb

And you'll get a working irb.

-austin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top