The return of the son of Umlaute

R

Rainer

Hello everybody,

I'd like to get a hint about a little Ruby problem I'm dealing with.
In my spare time I wrote a little Ruby program with a plain text
interface for I/O. I use Windows XP (yeah, I know :) with Ruby 1.8.6.

Now for the problem: In Germany, we have letters called "Umlaute",
which are rarely used in the English language. An example you might be
familiar with is the name ot the rock band "Motörhead". (No, I don't
know their music, it's just an example.)

Now, if I wanted to save a file with the single line "Motörhead", I
would want to do something like this:

irb(main):001:0> a_string = "Motörhead"
=> "Mot\224rhead"
irb(main):002:0> a_file = File.new("test.txt", "w")
=> #<File:test.txt>
irb(main):003:0> a_file << a_string
=> #<File:test.txt>
irb(main):004:0> a_file.close
=> nil
irb(main):005:0>

However, when I open "test.txt" with a text editor, I get this result:

Mot"rhead

When I read a file with the correct spelling...

irb(main):006:0> puts File.open("test.txt", "r").readlines

...I get this result:

Mot÷rhead
=> nil

I've tried to use Iconv, but I don't know the names of the character
sets I have to use. To solve the problem, I've written this little
script..

def umlaute (line, replace_dict = File_To_Shell)
changed_line = line.clone
replace_dict.each_pair {|from, to| changed_line.sub!(from, to)}
changed_line
end

....which uses two hashes in order to replace the "offending"
characters back and forth:

File_To_Shell = {
/\344/ => "\204", # "ä"
/\366/ => "\224", # "ö"
/\374/ => "\201", # "ü"
/\304/ => "\216", # "Ä"
/\326/ => "\231", # "Ö"
/\334/ => "\232", # "Ü"
/\337/ => "\341" # "ß"
}

Shell_To_File = {
/\204/ => "\344", # "ä"
/\224/ => "\366", # "ö"
/\201/ => "\374", # "ü"
/\216/ => "\304", # "Ä"
/\231/ => "\326", # "Ö"
/\232/ => "\334", # "Ü"
/\341/ => "\337" # "ß"
}

This seems to work, but I consider this a rather ugly workaround and
hardly a decent solution, since it forces me to type..

a_string = umlaute("Motörhead")

...or even..

puts umlaute("Motörhead")

...whenever I have a "strange" character or two in my strings. Any help
is appreciated.

Thanks in advance,

Rainer Wolf
 
M

mortee

Rainer said:
irb(main):001:0> a_string = "Motörhead"
=> "Mot\224rhead"
irb(main):002:0> a_file = File.new("test.txt", "w")
=> #<File:test.txt>
irb(main):003:0> a_file << a_string
=> #<File:test.txt>
irb(main):004:0> a_file.close
=> nil
irb(main):005:0>

However, when I open "test.txt" with a text editor, I get this result:

Mot"rhead

When I read a file with the correct spelling...

irb(main):006:0> puts File.open("test.txt", "r").readlines

..I get this result:

Mot÷rhead
=> nil

This has just been discussed today. You have to use the same charset
encodig for editing your Ruby code, and displaying the program output.
The first is determined (in your above case) by your terminal (because
you entered the code in irb), and the latter by the text editor in question.

If the two encodings don't match, then the byte string will be literally
misinterpreted.

mortee
 
7

7stud --

Welcome to unicode hell. By the way, there is a third ring: in order to
post about a unicode character on the internet, sometimes you need to
post the html entity corresponding to the unicode character--otherwise
browsers display some symbol indicating that they can't render the
character. For instance, all I see are a bunch of black diamonds with
question marks littering your post.

I wish I could help you, but I'm stumped as to why irb is showing you
\224 for "LATIN SMALL LETTER O WITH DIAERESIS". This is what I get:

irb(main):001:0> str="Motörhead"
=> "Mot\303\266rhead"


One of these should display correcly:

Motörhead -->o with umlaut entered from my special characters palette in
my text editor

Motörhead --> html entity for o with umlaut
 
R

Rainer

Thank you for your comments, mortee and 7stud.

I've tried to find out what the character sets are. My Windows shell
says it works on codepage 850. (I used the command "chcp" to find
out.) Everything else seems to be ISO-8859-1. Does anyone know what I
have to replace for one of the two ISO-strings below..

Iconv.new('ISO-8859-1', 'ISO-8859-1').iconv(line)

...when I want to convert "line" between ISO-8850-1 and codepage 850?

Any help is appreciated.

Rainer
 
E

ed.odanow

Please don't forget in addition, that Windows uses two different
encodings internally. You will see this when typing "äöü" using a
windows editor and listing the file using "type" in a windows console,
which will then produce...

C:\Dokumente und Einstellungen\wolfgang\Desktop>type umlaute.txt
õ÷³

...as output. This has the additional effect, that using Umlaute in
strings in "irb" and writing this texts to a file...

C:\Dokumente und Einstellungen\wolfgang\Desktop>irb
irb(main):001:0> File.open('umlautausgabe.txt', 'w') do |f|
irb(main):002:1* f.print 'äöü'
irb(main):003:1> end
=> nil
irb(main):004:0> exit

...will produce a file with some unreadable data, when opened with the
Windows editor...

„â€Â

In addition you will not be able to output utf-8 encoded data on a
Windows console (except those character, that are ASCII) in a correct
way.

To use utf-8 encoded constants in Ruby programms is easy - you need to
edit the data in utf-8 format by an editor. Unfortunately you need a
"magic line" on Windows for Ruby 1.8. utf-8 encoded data usually has a
BOM at the beginning of a file, which will not be ignored in Ruby 1.8.
You must start a program with a line like...

=nil

...to avoid this, and then start your Ruby program using...

ruby -Ku programfile.rb

In addition you should know, that Ruby 1.8 doesn't support utf-8
encoding in class String (there are existing extensions for that
purpose), so String handling is still based on bytes.

This did change completely for Ruby 1.9, where utf-8 support it
embedded.

Wolfgang Nádasi-Donner
 
E

ed.odanow

Take a look to http://www.gnu.org/software/libiconv/

I found there everything that is necessary to do the job :)

C:\Dokumente und Einstellungen\wolfgang>irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> Iconv.iconv('utf-8', 'CP850', "äöü")
=> ["\303\244\303\266\303\274"]
irb(main):003:0> Iconv.iconv('ISO-8859-1', 'CP850', "äöü")
=> ["\344\366\374"]

Wolfgang Nádasi-Donner
 
R

Rainer

Take a look tohttp://www.gnu.org/software/libiconv/

I found there everything that is necessary to do the job :)

C:\Dokumente und Einstellungen\wolfgang>irb
irb(main):001:0> require 'iconv'
=> true
irb(main):002:0> Iconv.iconv('utf-8', 'CP850', "äöü")
=> ["\303\244\303\266\303\274"]
irb(main):003:0> Iconv.iconv('ISO-8859-1', 'CP850', "äöü")
=> ["\344\366\374"]

Wolfgang Nádasi-Donner

Hello Wolfgang,

thank you very much, the second line above did the trick, and it works
both ways! The irony is that I actually had seen the libiconv page you
referred me to. What I had failed to see was the fact that the long
lists of character encodings and their explanations were meant to be
THE ACTUAL STRINGS you had to replace for "to" and "from" in the iconv
method. Argh!

Thanks also to mortee and 7stud and to everyone who helped me on this.
This newsgroup has been a wonderful experience so far.

As a little "Thank you" to this group, I'm giving a short summary for
anyone who gets stuck in the same place as me, written with a
"cookbook approach" (sort of):

PROBLEM: When you read text files and use their contents in irb, some
of the characters look strange.

REASON: The character encoding you are using for your text files is
different from the one in your irb shell.

SOLUTION:

1. Find out which character encoding is used in your irb shell.

The solution for step 1 depends on your operating system. If you're
working with Windows, open "cmd.exe" and type "chcp" at the prompt. If
you're in Germany like me, you'll probably read this:
Aktive Codepage: 850.

2. Find out about the character encoding for your files.

I didn't use a command line tool for this. In my case it's ISO-8859-1
(Western European countries). You will find a good introduction here:
http://en.wikipedia.org/wiki/ISO_8859 (and here if you're German:
http://de.wikipedia.org/wiki/ISO_8859).

3. Find the function to convert between the encodings.

The function is

Iconv.iconv(to, from, *strs)

from the iconv standard library. An explanation for the character
encodings used by iconv is here: http://www.gnu.org/software/libiconv/

4. Find out what to replace for "to" and "from".

This was the hard part for me. I failed to see that the explanations
on the libiconv page are the actual strings you have to replace. Small
excerpt:

----from libiconv page----
It provides support for the encodings:

European languages
ASCII, ISO-8859-{1,2,3,4,5,7,9,10,13,14,15,16}, KOI8-R, KOI8-U, KOI8-
RU, CP{1250,1251,1252,1253,1254,1257}, CP{850,866},
Mac{Roman,CentralEurope,Iceland,Croatian,Romania},
Mac{Cyrillic,Ukraine,Greek,Turkish}, Macintosh
....
----from libiconv page----

This means (of course): 'ISO-8859-1' or 'ISO-8859-2' or 'CP850' or
'CP866' and so on.

5. The actual code for converting the strings back and forth on my
Windows XP machine:

require 'iconv'
my_string = 'Motörhead'

#Converting from the shell to a file
shell_to_file = Iconv.iconv('ISO-8859-1', 'CP850', my_string)
f = File.open("umlaut.txt", "w")
f << shell_to_file
f.close

#Converting strings from the file in order to read it in the shell
f = File.open("umlaut.txt", "r")
string_from_file = f.readlines.first
file_to_shell = Iconv.iconv('CP850', 'ISO-8859-1', string_from_file)

That's it!

What I'm trying to do here is putting Martin Fowlers tip "if I want to
learn about something I write about it." to good use, so: Any comments
and criticism to this solution are appreciated.

Cheers,

Rainer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top