UTF-8 support - still stuck

T

Thomas Luedeke

OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I'm running Ruby 1.9.2p0.

Here's the type of pseudo-code I want to use.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

variable =3D "exag=C3=A9rer"

if variable =3D~ /=C3=A9rer$/ then
print "the verb was #{variable}"
end

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I've tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

-- =

Posted via http://www.ruby-forum.com/.=
 
Q

Quintus

Am 05.03.2011 18:53, schrieb Thomas Luedeke:
OK, I appreciate the feedback on my last post regarding pattern matching
accented French characters. But I am still not getting anywhere.

I'm running Ruby 1.9.2p0.

Here's the type of pseudo-code I want to use.

====================================

variable = "exagérer"

if variable =~ /érer$/ then
print "the verb was #{variable}"
end

====================================

I've tried using jcode (which is apparently gone), -u extensions, having
the string # coding: UTF-8 at the beginning of the script, etc.

What I really want to do is read in a comprehensive list of verbs (with
various French accented characters), then have a simple I/O where I test
myself on a conjugation. So I need to be able to read, write, and
pattern match the accented characters.

What do I have to do to make this work?

TPL

Encode your string as UTF-8 and match it against an UTF-8 regexp.
Simplest way to do this is to do something like this:

==============================
#Encoding: UTF-8

variable = "exagérer"

puts "The verb was #{variable}" if variable =~ /érer/
==============================

Ensure that your editor saves the file in UTF-8 (some don't do this by
default, notably Window's notepad and SciTE).

If you have the verbs in an external file (which I suppose), and that
file is encoded in UTF-8, you can do (assuming that there is one verb
per line):

=================================
#Encoding: UTF-8

verbs = File.readlines("verbs.txt")

puts "The verb was #{verbs.first}" if verbs.first =~ /érer/
=================================

If the file is in another encoding, e.g. Windows-1252, do

==================================
#Encoding: UTF-8

verbs = File.open("verbs.txt", "r:Windows-1252"){|f| f.readlines}

puts "The verb was #{verbs.first}" if verbs.first =~ /érer/
==================================

The line saying "#Encoding: UTF-8" is a so-called magic comment that
tells Ruby that it should treat the content of this file as
UTF-8-encoded text. If you leave it out, Ruby assumes your file is
encoded in ASCII-8Bit, which will cause errors as soon as you start to
use characters not defined in ASCII. As an alternative, you may start
Ruby with the -U (capital U) switch, but I didn't try this.

Read up on String#encode and String#force_encoding if you want to
convert between encodings or change the encoding tag of a string without
actual touching of the data in it.

Since Ruby 1.9, Ruby has quite good support for encodings other than ASCII.

Just a thought: Is there anything such as Regexp#encode?

Vale,
Marvin
 
Q

Quintus

Am 05.03.2011 20:31, schrieb Quintus:
=================================
#Encoding: UTF-8

What I forgot to mention: Some editors put an invisible BOM (Byte Order
Mark) at the beginning of UTF-8 files. That one can cause problems
because the first line is not read properly in that case. So ensure your
editor doesn't write the BOM.

Vale,
Marvin
 
7

7stud --

You can try and troubleshoot the problems you are having by determining
the encoding of every string in your program.

To determine your source code's encoding, i.e. what the literal strings
you type in your program get encoded as, do this:

puts __ENCODING__


To determine a particular string's encoding, e.g. a string you read from
a file, do this:

puts the_str.encoding.name
 
7

7stud --

By the way, if you read the strings from a file, it might be easier to
change the encoding of the regex to match the encoding of the strings.
 
T

Thomas Luedeke

VGhlIHNjcmlwdCBpczoKCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09
PT09PT09PT09PT0KCiMhIC9iaW4vcnVieSAtdlUKCiNFbmNvZGluZzogVVRG
LTgKCgoKdmVyYiA9ICJhcHDDqGxlciIKaWYgJHt2ZXJifSA9fiAvw6hsZXIv
IHRoZW4gcHJpbnQgIlRoZSB2ZXJiIHdhcyAje3ZlcmJ9IiBlbmQKCgo9PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09CgpUaGUgZXJy
b3IgSSBnZXQgaXM6CgpydWJ5IG5vdGUucmIKCm5vdGUucmI6OTogaW52YWxp
ZCBtdWx0aWJ5dGUgY2hhciAoVVRGLTgpCgpub3RlLnJiOjk6IHN5bnRheCBl
cnJvciwgdW5leHBlY3RlZCB0SURFTlRJRklFUiwgZXhwZWN0aW5nICRlbmQK
CnZlcmIgPSAiYXBwzqZsZXIiCgotLSAKUG9zdGVkIHZpYSBodHRwOi8vd3d3
LnJ1YnktZm9ydW0uY29tLy4=
 
Q

Quintus

Am 06.03.2011 08:47, schrieb Thomas Luedeke:
The script is:

========================================

#! /bin/ruby -vU

#Encoding: UTF-8



verb = "appèler"
if ${verb} =~ /èler/ then print "The verb was #{verb}" end


========================================

Don't leave a blank line between the shebang line and the magic comment.
The magic comment must either be the very first line, or the second one
if you have a shebang.

Vale,
Marvin
 
A

Alexey Petrushin

Add this line to your ~/.profile.

export RUBYOPT="-Ku -rrubygems"

Sadly, there's no other way to set global default source encoding in
ruby 1.9 :(
 
7

7stud --

Thomas Luedeke wrote in post #985708:
This seemed to have work in NotePad ++, set to UTF-8 and with the BOM
off:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

#! /bin/ruby -Kn

#Encoding: UTF-8

verb =3D "app=C3=A8ler"
if( verb =3D~ /=C3=A8ler/) then print "The verb was #{verb}" end

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

I think it was the -Kn flag, although I don't understand what that
changes. I'll look into it. Thanks for all your help!

In ruby, there is a variable called $KCODE. If you set it to "UTF-8" =

(or just "U"), then it makes regular expressions match characters rather =

than single bytes. If you set $KCODE to "N" (the default), then =

regular expressions will match single bytes (unless you use the /u flag =

on your regular expression).

You can set $KCODE from the command line, e.g. -Ku or -Kn.

-- =

Posted via http://www.ruby-forum.com/.=
 
B

brabuhr

The script is:

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
#! /bin/ruby -vU
#Encoding: UTF-8
verb =3D "app=C3=A8ler"
if ${verb} =3D~ /=C3=A8ler/ then print "The verb was #{verb}" end
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

note.rb:9: invalid multibyte char (UTF-8)

note.rb:9: syntax error, unexpected tIDENTIFIER, expecting $end

verb =3D "app=CE=A6ler"

Are you absolutely certain that your file is UTF-8 encoded?

$ cat i.rb
#Encoding: UTF-8
verb =3D "app=C3=A8ler"
puts "The verb was #{verb}" if verb =3D~ /=C3=A8ler/

$ ruby -v i.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
The verb was app=C3=A8ler

$ enca -L none i.rb
Universal transformation format 8 bits; UTF-8

$ iconv -t LATIN1 -f UTF8 < i.rb > l.rb

$ enca -L none l.rb
Unrecognized encoding

$ ruby -v l.rb
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
l.rb:2: invalid multibyte char (UTF-8)
l.rb:2: syntax error, unexpected tIDENTIFIER, expecting $end
verb =3D "app?ler"
^
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top