Ruby 1.9.2: /\w/u does not match umlauts ("ü")

A

Andreas S.

I found that, unlike Ruby 1.8, the word character class in Ruby 1.9
regexes does not match german umlauts (or any other letters other than
ASCII). According to the oniguruma documentation
(http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt), it should match
everything from the unicode "letter" category, which includes umlauts.

test.rb (also attached):
# encoding: utf-8
$KCODE='u'
s = "ü"
puts s.match(/\w/u).inspect

Result with ruby 1.8:
#<MatchData "ü">

Result with ruby 1.9.2:
nil

Is that a bug, or is there any reason behind this behavior?

Attachments:
http://www.ruby-forum.com/attachment/5113/test.rb
 
R

Roger Pack

Andreas said:
I found that, unlike Ruby 1.8, the word character class in Ruby 1.9
regexes does not match german umlauts (or any other letters other than
ASCII). According to the oniguruma documentation
(http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt), it should match
everything from the unicode "letter" category, which includes umlauts.

http://github.com/rdp/ruby_tutorials_core/wiki/Ruby-Talk-FAQ#unicode_not_found

so it's intended, however if you extremely dislike this, then complain
about it since apparently it's surprising to a number of people :)

-r
 
A

Andreas S.

http://github.com/rdp/ruby_tutorials_core/wiki/Ruby-Talk-FAQ#unicode_not_found

"Basically at a certain patch level of 1.9.1, \w was set to no longer
match unicode characters, because the core developers were concerned
that this was not what people expected from \w."

Well, 1.9.2 behaving differently than 1.9.1 and 1.8 is certainly less
expected.

Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this ("word"
it's not a unicode character category).
 
R

Roger Pack

Well, 1.9.2 behaving differently than 1.9.1 and 1.8 is certainly less
expected.

yeah. 1.9.1 behaving differently with a different *patch level* is less
than expected, too.
Apparently in 1.9 \p{Word} can be used instead of \w to match unicode
characters; however I did not find any documentation for this ("word"
it's not a unicode character category).

That's odd that there's no standard. Maybe ruby made this up on their
own, then?

I think it's mentioned briefly

http://svn.ruby-lang.org/repos/ruby/trunk/doc/re.rdoc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top