Ruby1.9 Encoding

Juliano ì¤€í˜¸ · Sep 10, 2009

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon.

Here it is.

I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

def run(txt)
txt = txt.encode("EUC-KR")
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
IO:

open(cmd, "r:EUC-KR").read.encode("UTF-8")
end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

Thanks for your patience reading this looong post.

Juliano

Axel Etzold · Sep 10, 2009

-------- Original-Nachricht --------

Datum: Thu, 10 Sep 2009 18:20:06 +0900
Von: "Juliano ì¤€í˜¸" <[email protected]>
An: (e-mail address removed)
Betreff: Ruby1.9 Encoding

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon.

Here it is.

I'm writing a wrapper for a Korean Morphological Parser that only
works with EUC-KR encoding and has some trouble with longer texts. So,
first, I have to preprocess the input text to divide it into
sentences, remove unicode characters which are not related to Korean
and save them for further reinclusion in the postprocessing stage.
This has worked out wonderfully and, I should say, easier than what
I'd done in Python.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"... I know I'm missing
something, but I can't seem to find enough info around... Google
doesn't seem to know much of this either... So, I'm coming here to ask
for your enlightenment, dear rubyist friends!

The part of my code which deals with this is as follows:

def run(txt)
txt = txt.encode("EUC-KR")
kts_file = Tempfile::new('kts_text')
kts_file = open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd = "ktspell < #{kts_file.path}" # 2> /dev/null"
IO:open(cmd, "r:EUC-KR").read.encode("UTF-8")
end

I found something about "ignoring" the non-existent codepoints, but it
doesn't work... I'm even thinking that my Ruby installation might have
gotten corrupted somehow... Everytime I think I did it right, I still
get The Exception popping up on the screen...

Thanks for your patience reading this looong post.

Juliano

Dear Juliano,

a disclaimer first: I know no Korean, so what's below might not work.

I've had to do some coding to resolve Arabic ligatures (combinations
of two letters) recently. Similarly as what you describe, there is most
of the time no need to use a special combined form, and unluckily, the
same word is sometimes spelled in this and sometimes in that way, giving
a list of duplicate words.

I used a list of Unicode characters with names of the individual characters
to solve that problem.

You might find the table below on this page useful :

http://www.kfunigraz.ac.at/~katzer/korean_hangul_unicode.html

I don't know if that list is exhaustive, but you may try to individually
convert each of the syllables listed there from Unicode to EUC::KR, and
if that doesn't work, decide what to do with the particular combination
of signs, based on the Latin transcription, creating a transform hash
for these encodings yourself.

There might also be some locale or OS-related problems with Iconv::IGNORE .
There's some discussion of this here :
http://aspn.activestate.com/ASPN/Mail/Message/ruby-talk/3189105

Best regards,

Axel

James Edward Gray II · Sep 10, 2009

Hey, guys!

I've just started learning Ruby from Python and recently posted a
question here that was promptly and effectively answered (thanks,
Glenn!), so I decided to come here once more. I hope I'll be able to
be answering some of the questions on my own soon.

Welcome to Ruby.

The problem I'm having now is converting the string from UTF-8 (I'm
running Ubuntu with pt_BR.UTF-8 locale) into EUC-KR, run the
Morphological Parser, read its output and process it. I have to parse
a whole bunch of data "spidered" from the internet and it works great
until the encoder comes across a Korean typo... Let's make this point
clearer: the EUC-KR encoding does not cover all the possible
combinations of initial and final consonants + vowels, as does
unicode. Explaining: Unicode has 21708 codepoints for hangul whereas
EUC-KR has only 11172. In fact, most of the extra chars are not used
in day by day life, but still they can be used as abbreviations,
slang, smileys or typos...

In Python, I would simply throw these chars away but I really didn't
manage to understand the "Ruby encoding way"...

I think we can throw them away in Ruby too. See below.

I know I'm missing something, but I can't seem to find enough info =20
around... Google
doesn't seem to know much of this either...

I wrote a lot about Ruby's encoding engine on my blog:

http://blog.grayproductions.net/articles/understanding_m17n

The part of my code which deals with this is as follows:

def run(txt)
txt =3D txt.encode("EUC-KR")

Try replacing the above line with:

txt =3D txt.encode("EUC-KR", invalid: :replace, undef: :replace, =20
replace: "")

kts_file =3D Tempfile::new('kts_text')
kts_file =3D open(kts_file.path, "w:EUC-KR")
kts_file << "#{txt}\n"
kts_file.close
cmd =3D "ktspell < #{kts_file.path}" # 2> /dev/null"
IO:open(cmd, "r:EUC-KR").read.encode("UTF-8")
end

Hope that helps.

James Edward Gray II

Ruby1.9: Encoding problems (how to use #force_encoding ?)	5	Sep 1, 2009
best technique for detecting charset/character encoding of RSS feeds	0	Dec 12, 2008
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
changing string encoding	0	Sep 6, 2005
new encodings in 1.8	0	Mar 25, 2014
Using local resource file.	2	May 14, 2007
Encoding/decoding a image as Base64 (fails under Ruby1.9 but worksunder Ruby1.8)	7	Dec 3, 2009
decode a string to "Perl's internal form" without Encode module?	4	Feb 28, 2007

Ruby1.9 Encoding

Juliano ì¤€í˜¸

Axel Etzold

James Edward Gray II

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads