how to match acented letters on windows

G

gabriele renzi

Hi gurus and nubys,

I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.

I've not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.

(running pragprog ruby 1.8.1 on win xp)
 
S

Stephan Kämper

Hi Gabriele,

gabriele said:
Hi gurus and nubys,

I just noticed that accented letters like èàéòùì (actually, if someone
can see them correctly in this message, either)
are not matched by /[a-z]/ or \w on windows.
irb(main):001:0> a = "Wrongly áccèntêd"
=> "Wrongly \240cc\212nt\210d"
irb(main):002:0> a =~ /é/
=> nil
irb(main):003:0> a =~ /è/
=> 11
irb(main):004:0> a =~ /ê/
=> 14
irb(main):005:0>

They are matched, but they're not part of [a-z] apparently.
What I think is, [a-z] is somehow mapped to the ASCII (or whatever) code.
Anyway

irb(main):002:0> a =~ /[é-ê]/
=> 14
irb(main):003:0>

(Running a 1.8.0 on WinXP)

BTW, you could as well write "yöûr stríng gòés héré" =~ /[\224-\239]/
Using that you'll get at least the vocals (I know of).
Note that there are a whole lot of accented letters: Check your
character table, somewhere in Start|Programs|Utilities|System...
I translated from my German menu entries, so YMMV a bit.
I've not tryed on *nix with proper locale set, but I wonder if,
anyway, there is something special I should do to allow this kind of
special letters to be matched as letters.

(running pragprog ruby 1.8.1 on win xp)

Now, that leads me to a question: _Should_ accented letters be matched
by [a-z]? I personally am not sure whether it does...

Happy rubying

Stephan
 
P

Peter Hickman

Stephan said:
Now, that leads me to a question: _Should_ accented letters be matched
by [a-z]? I personally am not sure whether it does...

Well if your locale was french then you would expect the accented
characters to match those used in french but it should ignore the
icelandic thorn of the dutch y umlaut thingy.

So the answer is depends, and it depends on locale.
 
M

Mark Hubbart

Stephan said:
Now, that leads me to a question: _Should_ accented letters be
matched by [a-z]? I personally am not sure whether it does...

Well if your locale was french then you would expect the accented
characters to match those used in french but it should ignore the
icelandic thorn of the dutch y umlaut thingy.

So the answer is depends, and it depends on locale.

I wouldn't want it set based on a pre-set locale... I don't think that
would be dynamic enough. What if you need to match characters from more
than one language?

Maybe this should be handled by character classes? What if we could
modify a character class with a country/language code? Something like:

/[[:alpha-es:]]*/.match("mañana").to_s #=> "mañana"

or to match valid characters in any language:

str = "mañana n'êtes"
/[[:alpha-all:] ]*/.match(str).to_s #=> "mañana n'êtes"

this would probably have trouble if the text wasn't unicode, though.

--Mark
 
S

Simon Strandgaard

I wouldn't want it set based on a pre-set locale... I don't think that
would be dynamic enough. What if you need to match characters from more
than one language?

Maybe this should be handled by character classes? What if we could
modify a character class with a country/language code? Something like:

Have a look at this document for more info about i18n/m17n in regexp
http://www.unicode.org/unicode/reports/tr18/

However neither Gnu nor Oniguruma supports it fully.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,076
Latest member
OrderKetoBeez

Latest Threads

Top