UTF-8 support - still stuck

Thomas Luedeke · Mar 5, 2011

OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I'm running Ruby 1.9.2p0.

Here's the type of pseudo-code I want to use.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

variable =3D "exag=C3=A9rer"

if variable =3D~ /=C3=A9rer$/ then
print "the verb was #{variable}"
end

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I've tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

-- =

Posted via http://www.ruby-forum.com/.=

Quintus · Mar 5, 2011

Am 05.03.2011 18:53, schrieb Thomas Luedeke:

OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I'm running Ruby 1.9.2p0.

Here's the type of pseudo-code I want to use.

====================================

variable = "exagÃ©rer"

if variable =~ /Ã©rer$/ then
print "the verb was #{variable}"
end

====================================

I've tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

Encode your string as UTF-8 and match it against an UTF-8 regexp.
Simplest way to do this is to do something like this:

==============================
#Encoding: UTF-8

variable = "exagÃ©rer"

puts "The verb was #{variable}" if variable =~ /Ã©rer/
==============================

Ensure that your editor saves the file in UTF-8 (some don't do this by
default, notably Window's notepad and SciTE).

If you have the verbs in an external file (which I suppose), and that
file is encoded in UTF-8, you can do (assuming that there is one verb
per line):

=================================
#Encoding: UTF-8

verbs = File.readlines("verbs.txt")

puts "The verb was #{verbs.first}" if verbs.first =~ /Ã©rer/
=================================

If the file is in another encoding, e.g. Windows-1252, do

==================================
#Encoding: UTF-8

verbs = File.open("verbs.txt", "r:Windows-1252"){|f| f.readlines}

puts "The verb was #{verbs.first}" if verbs.first =~ /Ã©rer/
==================================

The line saying "#Encoding: UTF-8" is a so-called magic comment that
tells Ruby that it should treat the content of this file as
UTF-8-encoded text. If you leave it out, Ruby assumes your file is
encoded in ASCII-8Bit, which will cause errors as soon as you start to
use characters not defined in ASCII. As an alternative, you may start
Ruby with the -U (capital U) switch, but I didn't try this.

Read up on String#encode and String#force_encoding if you want to
convert between encodings or change the encoding tag of a string without
actual touching of the data in it.

Since Ruby 1.9, Ruby has quite good support for encodings other than ASCII.

Just a thought: Is there anything such as Regexp#encode?

Vale,
Marvin

Quintus · Mar 5, 2011

Am 05.03.2011 20:31, schrieb Quintus:

=================================
#Encoding: UTF-8

What I forgot to mention: Some editors put an invisible BOM (Byte Order
Mark) at the beginning of UTF-8 files. That one can cause problems
because the first line is not read properly in that case. So ensure your
editor doesn't write the BOM.

Vale,
Marvin

7stud -- · Mar 5, 2011

You can try and troubleshoot the problems you are having by determining
the encoding of every string in your program.

To determine your source code's encoding, i.e. what the literal strings
you type in your program get encoded as, do this:

puts __ENCODING__

To determine a particular string's encoding, e.g. a string you read from
a file, do this:

puts the_str.encoding.name

7stud -- · Mar 5, 2011

By the way, if you read the strings from a file, it might be easier to
change the encoding of the regex to match the encoding of the strings.

Thomas Luedeke · Mar 6, 2011

VGhlIHNjcmlwdCBpczoKCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT0KCiMhIC9iaW4vcnVieSAtdlUKCiNFbmNvZGluZzogVVRG
LTgKCgoKdmVyYiA9ICJhcHDDqGxlciIKaWYgJHt2ZXJifSA9fiAvw6hsZXIv
IHRoZW4gcHJpbnQgIlRoZSB2ZXJiIHdhcyAje3ZlcmJ9IiBlbmQKCgo9PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CgpUaGUgZXJy
b3IgSSBnZXQgaXM6CgpydWJ5IG5vdGUucmIKCm5vdGUucmI6OTogaW52YWxp
ZCBtdWx0aWJ5dGUgY2hhciAoVVRGLTgpCgpub3RlLnJiOjk6IHN5bnRheCBl
cnJvciwgdW5leHBlY3RlZCB0SURFTlRJRklFUiwgZXhwZWN0aW5nICRlbmQK
CnZlcmIgPSAiYXBwzqZsZXIiCgotLSAKUG9zdGVkIHZpYSBodHRwOi8vd3d3
LnJ1YnktZm9ydW0uY29tLy4=

Quintus · Mar 6, 2011

Am 06.03.2011 08:47, schrieb Thomas Luedeke:

The script is:

========================================

#! /bin/ruby -vU

#Encoding: UTF-8

verb = "appÃ¨ler"
if ${verb} =~ /Ã¨ler/ then print "The verb was #{verb}" end

========================================

Don't leave a blank line between the shebang line and the magic comment.
The magic comment must either be the very first line, or the second one
if you have a shebang.

Vale,
Marvin

Alexey Petrushin · Mar 7, 2011

Add this line to your ~/.profile.

export RUBYOPT="-Ku -rrubygems"

Sadly, there's no other way to set global default source encoding in
ruby 1.9

7stud -- · Mar 8, 2011

Thomas Luedeke wrote in post #985708:

This seemed to have work in NotePad ++, set to UTF-8 and with the BOM
off:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

#! /bin/ruby -Kn

#Encoding: UTF-8

verb =3D "app=C3=A8ler"
if( verb =3D~ /=C3=A8ler/) then print "The verb was #{verb}" end

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I think it was the -Kn flag, although I don't understand what that
changes. I'll look into it. Thanks for all your help!

In ruby, there is a variable called $KCODE. If you set it to "UTF-8" =

(or just "U"), then it makes regular expressions match characters rather =

than single bytes. If you set $KCODE to "N" (the default), then =

regular expressions will match single bytes (unless you use the /u flag =

on your regular expression).

You can set $KCODE from the command line, e.g. -Ku or -Kn.

-- =

Posted via http://www.ruby-forum.com/.=

brabuhr · Mar 8, 2011

The script is:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
#! /bin/ruby -vU
#Encoding: UTF-8
verb =3D "app=C3=A8ler"
if ${verb} =3D~ /=C3=A8ler/ then print "The verb was #{verb}" end
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

note.rb:9: invalid multibyte char (UTF-8)

note.rb:9: syntax error, unexpected tIDENTIFIER, expecting $end

verb =3D "app=CE=A6ler"

Are you absolutely certain that your file is UTF-8 encoded?

$ cat i.rb
#Encoding: UTF-8
verb =3D "app=C3=A8ler"
puts "The verb was #{verb}" if verb =3D~ /=C3=A8ler/

$ ruby -v i.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
The verb was app=C3=A8ler

$ enca -L none i.rb
Universal transformation format 8 bits; UTF-8

$ iconv -t LATIN1 -f UTF8 < i.rb > l.rb

$ enca -L none l.rb
Unrecognized encoding

$ ruby -v l.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
l.rb:2: invalid multibyte char (UTF-8)
l.rb:2: syntax error, unexpected tIDENTIFIER, expecting $end
verb =3D "app?ler"
^

rake aborted! Validation failed:	4	Oct 12, 2010
String.strip with UTF-8	6	Jan 12, 2011
Encoding nightmare	6	May 4, 2011
Unicode escaping fun & games	0	Apr 23, 2009
Calcul XOR : array , times.	3	May 13, 2011
Reading a CSV file with UTF-16LE encoding	4	Jan 13, 2011
[ANN] rs 0.1.2	0	Oct 19, 2006
[ANN] Sipper 2.0.0 Released	1	Jun 24, 2009

UTF-8 support - still stuck

Thomas Luedeke

Quintus

Quintus

7stud --

7stud --

Thomas Luedeke

Quintus

Alexey Petrushin

7stud --

brabuhr

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads