Unicode in Regex

Greg Willits · Nov 30, 2007

This is mostly a Ruby thing, and partly a Rails thing.

I'm expecting a validate_format_of with a regex like this

/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/

to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
web form.

However, the extended characters are being rejected.

This works just fine though (which is just a-zA-Z)

/^[\x41-\x5A\x61-\x7A\.\'\-\ ]*?$/

It also seems to fail with full \x0000 numbers, is there limit at \xFF?

Some plain Ruby tests seem to suggest unicode characters don't work at
all??

p 'abvHgtwHFuG'.scan(/[a-z]/)
p 'abvHgtwHFuG'.scan(/[A-Z]/)
p 'abvHgtwHFuG'.scan(/[\x41-\x5A]/)
p 'abvHgtwHFuG'.scan(/[\x61-\x7A]/)
p 'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[\xC0-\xD6\xD9-\xF6\xF9-\xFF]/)

["a", "b", "v", "g", "t", "w", "u"]
["H", "H", "F", "G"]
["H", "H", "F", "G"]
["a", "b", "v", "g", "t", "w", "u"]
["\303", "\303", "\303", "\303"]

So, what's the secret to using unicode character ranges in Ruby regex
(or Rails validations)?

Dale Martenson · Nov 30, 2007

So, what's the secret to using unicode character ranges in Ruby regex
(or Rails validations)?

Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
Ruby Conference. His presentation can be found at:

http://www.tbray.org/talks/rubyconf2006.pdf

He described how many member functions have trouble dealing with these
character sets. He made special reference to regular expressions.

--Dale

Greg Willits · Nov 30, 2007

Dale said:
Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
Ruby Conference. His presentation can be found at:

http://www.tbray.org/talks/rubyconf2006.pdf

He described how many member functions have trouble dealing with these
character sets. He made special reference to regular expressions.

That's just beyond sad.

I've been using Lasso for several years now, and *2003* it provided
complete support for Unicode. I know there's some esoterics it may not
deal with, but for all practical purposes we can round-trip data in
western and eastern languages with Lasso quite easily.

How can all these other languages be so far behind?

Pretty bad if I can't even allow Mr. MuÃ±os or GÃ¶ran to enter their names
in a web form with proper server side validations. Aargh.

-- gw

MonkeeSage · Dec 1, 2007

That's just beyond sad.

I've been using Lasso for several years now, and *2003* it provided
complete support for Unicode. I know there's some esoterics it may not
deal with, but for all practical purposes we can round-trip data in
western and eastern languages with Lasso quite easily.

How can all these other languages be so far behind?

Pretty bad if I can't even allow Mr. Muños or Göran to enter their names
in a web form with proper server side validations. Aargh.

-- gw

Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).
Everything in ruby is a bytestring.

irb(main):001:0> 'aébvHögtåwHÅFuG'.scan(/./)
=> ["a", "\303", "\251", "b", "v", "H", "\303", "\266", "g", "t",
"\303", "\245", "w", "H", "\303", "\205", "F", "u", "G"]

So your character class is matching the first byte of the composite
characters (which is \303 in octal), and skipping the next (since it's
below the range). You probably want something like...

reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
'aébvHögtåwHÅFuG'.scan(reg)

irb(main):006:0* reg = /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
=> /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
irb(main):007:0> 'aébvHögtåwHÅFuG'.scan(reg)
=> ["\303\251", "\303\266", "\303\245", "\303\205"]
irb(main):008:0> "å" == "\303\245"
=> true

Ps. I'm not entirely sure the value of the second character class is
right.

Regards,
Jordan

Jimmy Kofler · Dec 1, 2007

Unicode in Regex

Posted by Greg Willits (-gw-) on 30.11.2007 21:18
This is mostly a Ruby thing, and partly a Rails thing.

I'm expecting a validate_format_of with a regex like this

/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/

to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
web form.

How about the utf8 validation regex here:
http://snippets.dzone.com/posts/show/4527 ?

Greg Willits · Dec 2, 2007

Greg said:
I'm expecting a validate_format_of with a regex like this
/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
web form. However, the extended characters are being rejected.

So, I've been pounding the web for info on UTF8 in Ruby and Rails the
past couple days to concoct some validations that allow UTF8
characters. I have discovered that I can get a little further by doing
the
following:
- declaring $KCODE = 'UTF8'
- adding /u to regex expressions.

The only thing not working now is the ability to define a range of \x
characters in a regex.

So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an Ã¤ in it. Perfect.

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u

But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

I've boiled the experiments down to realizing I can't define a range
with \x

Is this just one of those things that just doesn't work yet WRT Ruby/
Rails/UTF8, or is there another syntax? I've scoured all the regex
docs I can find, and they seem to indicate a range should work.

For now, I just have all the characters I want included < \xFF listed
individually.

utf_accents = '\xC0\xC1\xC2\.......'

Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u

But I'd like to solve the range notation if I can.

Daniel DeLorme · Dec 3, 2007

MonkeeSage said:
Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).

I enrages me to see this kind of FUD. Through regular expressions, ruby
1.8 has 80-90% complete utf8 support. And oniguruma makes utf8 support
well-near 100% complete.
=> ["a", "Ã©", "b", "v", "H", "Ã¶", "g", "t", "Ã¥", "w", "H", "Ã…", "F",
"u", "G"]

'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[Ã©Ã¶Ã¥Ã…]/u)

Click to expand...

=> ["Ã©", "Ã¶", "Ã¥", "Ã…"]

Ok, sometimes you have to take a weird approach because of the missing
10-20%, but it's still workable=> ["Ã©", "Ã¶", "Ã¥", "Ã…"]

Everything in ruby is a bytestring.

YES! And that's exactyly how it should be. Who is it that spread the
flawed idea that strings are fundamentally made of characters? I'd like
to slap him around a little. Fundamentally, ever since the word "string"
was applied to computing, strings were made of 8-BIT CHARS, not n-bit
characters. If only the creators of C has called that datatype "byte"
instead of "char" it would have saved us so many misunderstandings.

Usually the complaint about the support lack of unicode support is that
something like "æ—¥æœ¬èªž".length returns 9 instead of 3, or that "æ—¥æœ¬èªž
".index("èªž") returns 6 instead of 2. It's nice that people want to
completely redefine the API to return character positions and all that,
but please don't complain that it's broken just because you happen to be
using it incorrectly. Use the right tool for the job. SQL for database
queries, non-home-brewed crypto libraries for security, regular
expressions for string manipulation.

I'm terribly sorry for the rant but I had to get it off my chest.

Dan

Daniel DeLorme · Dec 3, 2007

Greg said:
Greg said:

I'm expecting a validate_format_of with a regex like this
/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
to allow many of the normal characters like Ã¶ Ã© Ã¥ to be submitted via
web form. However, the extended characters are being rejected.

Click to expand...

So, I've been pounding the web for info on UTF8 in Ruby and Rails the
past couple days to concoct some validations that allow UTF8
characters. I have discovered that I can get a little further by doing
the
following:
- declaring $KCODE = 'UTF8'
- adding /u to regex expressions.

The only thing not working now is the ability to define a range of \x
characters in a regex.

So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an Ã¤ in it. Perfect.

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u

But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

I've boiled the experiments down to realizing I can't define a range
with \x

Is this just one of those things that just doesn't work yet WRT Ruby/
Rails/UTF8, or is there another syntax? I've scoured all the regex
docs I can find, and they seem to indicate a range should work.

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[\303\251]/u) => []
'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[#{"\303\251"}]/u)

Click to expand...

=> ["Ã©"]

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "Ã©" and a match is found.

So ranges *do* work in utf8 but you have to be careful:

"Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[Ã¤-Ã®]/u) => ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]
"Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[\303\244-\303\256]/u)

Click to expand...

=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]

"Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[#{"\303\244-\303\256"}]/u)

Click to expand...

=> ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]

Hope this helps.

Dan

MonkeeSage · Dec 3, 2007

Greg said:
Greg said:

I'm expecting a validate_format_of with a regex like this
/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
to allow many of the normal characters like ö é å to be submitted via
web form. However, the extended characters are being rejected.

Click to expand...

So, I've been pounding the web for info on UTF8 in Ruby and Rails the
past couple days to concoct some validations that allow UTF8
characters. I have discovered that I can get a little further by doing
the
following:
- declaring $KCODE = 'UTF8'
- adding /u to regex expressions.

The only thing not working now is the ability to define a range of \x
characters in a regex.

So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an ä in it. Perfect.

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u

But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

I've boiled the experiments down to realizing I can't define a range
with \x

Is this just one of those things that just doesn't work yet WRT Ruby/
Rails/UTF8, or is there another syntax? I've scoured all the regex
docs I can find, and they seem to indicate a range should work.

For now, I just have all the characters I want included < \xFF listed
individually.

utf_accents = '\xC0\xC1\xC2\.......'

Is_person_name = /^[a-zA-Z#{utf_accents}\.\'\-\ ]*?$/u

But I'd like to solve the range notation if I can.

--
def gw
acts_as_n00b
writes_at(www.railsdev.ws)
end

This seems to work...

$KCODE = "UTF8"
p /^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?/u =~ "Jäsp...it
works"
# => 0

However, it looks to me like it would be more robust to use a slightly
modified version of UTF8REGEX (found in the link Jimmy posted
above)...

UTF8REGEX = /\A(?:
[a-zA-Z\.\-\'\ ]
| [\xC2-\xDF][\x80-\xBF]
| \xE0[\xA0-\xBF][\x80-\xBF]
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
| \xED[\x80-\x9F][\x80-\xBF]
| \xF0[\x90-\xBF][\x80-\xBF]{2}
| [\xF1-\xF3][\x80-\xBF]{3}
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/mnx

p UTF8REGEX =~ "Jäsp...it works here too"
# => 0

Look at the link to see the explanation of the alternations.

Regards,
Jordan

Daniel DeLorme · Dec 3, 2007

Greg said:
So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an Ã¤ in it. Perfect.

If that actually works, it means you are really using ISO-8859-1
strings, not UTF-8.

utf_accents = '\xC0\xC1\xC2\.......'

Nope, that's not UTF-8. UTF-8 characters Ã€ÃÃ‚ would look like
utf_accents = "\xC3\x80\xC3\x81\xC3\x82..."

Dan

Greg Willits · Dec 3, 2007

Daniel said:
Greg Willits wrote:

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

I've boiled the experiments down to realizing I can't define a range
with \x

Click to expand...

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[\303\251]/u) => []
'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[#{"\303\251"}]/u)

Click to expand...

Click to expand...

=> ["Ã©"]

OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
character code point -- which with your explanation I can finally tie
together what that means.

Took me a second to recognize the #{} as Ruby and not some new regex I'd
never seen

And I realize now too I wasn't picking up on the use of octal vs
decimal.

Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "Ã©" and a match is found.

So ranges *do* work in utf8 but you have to be careful:

"Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[Ã¤-Ã®]/u) => ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]
"Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[\303\244-\303\256]/u)

Click to expand...

Click to expand...

=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]

"Ã Ã¢Ã¤Ã§Ã¨Ã©ÃªÃ®Ã¯Ã´Ã¼".scan(/[#{"\303\244-\303\256"}]/u)

Click to expand...

Click to expand...

=> ["Ã¤", "Ã§", "Ã¨", "Ã©", "Ãª", "Ã®"]

Hope this helps.

Yes!

-- gw

Greg Willits · Dec 3, 2007

Basically, \xE4 is counted as the byte value 0xE4, not the unicode

character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[\303\251]/u) => []
'aÃ©bvHÃ¶gtÃ¥wHÃ…FuG'.scan(/[#{"\303\251"}]/u)

Click to expand...

=> ["Ã©"]

Click to expand...

OK, one thing I'm still confused about -- when I look up Ã© in any table,
it's DEC is 233 which converted to OCT is 351, yet you're using 251 (and
indeed it seems like reducing the OCTs I come up with by 100 is what
actually works).

Where is this 100 difference coming from?

-- gw

Phrogz · Dec 3, 2007

'aébvHögtåwHÅFuG'.scan(/[\303\251]/u)
=> []
'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)
=> ["é"]

Click to expand...

Click to expand...

OK, one thing I'm still confused about -- when I look up é in any table,
it's DEC is 233 which converted to OCT is 351, yet you're using 251 (and
indeed it seems like reducing the OCTs I come up with by 100 is what
actually works).

Where is this 100 difference coming from?

http://www.fileformat.info/info/unicode/char/00e9/index.htm

The UTF-16 value is 233 (decimal), but the UTF-8 value is 0xC3 0xA9,
which is 195 169 in decimal, or 0303 0251 in octal.

Phrogz · Dec 3, 2007

Daniel said:
Daniel said:

Greg said:

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
I've boiled the experiments down to realizing I can't define a range
with \x

Click to expand...

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) => []
'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)

Click to expand...

=> ["é"]

Click to expand...

OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
character code point -- which with your explanation I can finally tie
together what that means.

Took me a second to recognize the #{} as Ruby and not some new regex I'd
never seen

And I realize now too I wasn't picking up on the use of octal vs
decimal.

Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "é" and a match is found.

Click to expand...

So ranges *do* work in utf8 but you have to be careful:

"àâäçèéêîïôü".scan(/[ä-î]/u)

Click to expand...

=> ["ä", "ç", "è", "é", "ê", "î"]

"àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)

Click to expand...

=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]

"àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)

Click to expand...

=> ["ä", "ç", "è", "é", "ê", "î"]

Click to expand...

Hope this helps.

Click to expand...

Yes!

-- gw

Charles Oliver Nutter · Dec 4, 2007

Daniel said:
Usually the complaint about the support lack of unicode support is that
something like "æ—¥æœ¬èªž".length returns 9 instead of 3, or that "æ—¥æœ¬èªž
".index("èªž") returns 6 instead of 2. It's nice that people want to
completely redefine the API to return character positions and all that,
but please don't complain that it's broken just because you happen to be
using it incorrectly. Use the right tool for the job. SQL for database
queries, non-home-brewed crypto libraries for security, regular
expressions for string manipulation.

I'm terribly sorry for the rant but I had to get it off my chest.

Regular expressions for all character work would be a *terribly* slow
way to get things done. If you want to get the nth character, should you
do a match for n-1 characters and a group to grab the nth? Or would it
be better if you could just index into the string and have it do the
right thing? How about if you want to iterate over all characters in a
string? Should the iterating code have to know about the encoding?
Should you use a regex to peel off one character at a time? Absurd.

Regex for string access goes a long way, but's just about the heaviest
way to do it. Strings should be aware of their encoding and should be
able to provide you access to characters as easily as bytes. That's what
1.9 (and upcoming changes in JRuby) fixes.

- Charlie

MonkeeSage · Dec 4, 2007

Greg said:
Greg said:

Greg Willits wrote:

I'm expecting a validate_format_of with a regex like this
/^[a-zA-Z\xC0-\xD6\xD9-\xF6\xF9-\xFF\.\'\-\ ]*?$/
to allow many of the normal characters like ö é å to be submittedvia
web form. However, the extended characters are being rejected.

Click to expand...

Click to expand...

So, I've been pounding the web for info on UTF8 in Ruby and Rails the
past couple days to concoct some validations that allow UTF8
characters. I have discovered that I can get a little further by doing
the
following:
- declaring $KCODE = 'UTF8'
- adding /u to regex expressions.

Click to expand...

The only thing not working now is the ability to define a range of \x
characters in a regex.

Click to expand...

So, this /^[a-zA-Z\xE4]*?&/u will validate that a string is allowed
to have an ä in it. Perfect.

Click to expand...

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u

Click to expand...

But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u

Click to expand...

I've boiled the experiments down to realizing I can't define a range
with \x

Click to expand...

Is this just one of those things that just doesn't work yet WRT Ruby/
Rails/UTF8, or is there another syntax? I've scoured all the regex
docs I can find, and they seem to indicate a range should work.

Click to expand...

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) => []
'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)

Click to expand...

Click to expand...

=> ["é"]

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "é" and a match is found.

So ranges *do* work in utf8 but you have to be careful:

"àâäçèéêîïôü".scan(/[ä-î]/u) => ["ä", "ç", "è", "é", "ê", "î"]
"àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)

Click to expand...

Click to expand...

=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]

"àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)

Click to expand...

Click to expand...

=> ["ä", "ç", "è", "é", "ê", "î"]

Hope this helps.

Dan

I missed your ranting.

Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS*
a standard mapping of bytes to *characters*. That's what unicode is.
I'm sorry you don't like that, but don't lie and say ruby 1.8 supports
unicode when it knows nothing about that standard mapping and treats
everything as individual bytes (and any byte with a value greater than
126 just prints an octal escape); and please don't accuse others of
spreading FUD when they state the facts.

Secondly, as I said in my first post to this thread, the characters
trying to be matched are composite characters, which requires you to
match both bytes. You can try to using a unicode regexp, but then you
run into the problem you mention--the regexp engine expects the pre-
composed, one-byte form...

"ò".scan(/[\303\262]/u) # => []
"ò".scan(/[\xf2]/u) # => ["\303\262"]

...which is why I said it's more robust to use something like the the
regexp that Jimmy linked to and I reposted, instead of a unicode
regexp.

Regards,
Jordan

MonkeeSage · Dec 4, 2007

Daniel said:
Daniel said:

Greg said:

But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u
But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u
I've boiled the experiments down to realizing I can't define a range
with \x

Click to expand...

Let me try to explain that in order to redeem myself from my previous
angry post.

Basically, \xE4 is counted as the byte value 0xE4, not the unicode
character U+00E4. And in a range expression, each escaped value is taken
as one character within the range. Which results in not-immediately
obvious situations:

'aébvHögtåwHÅFuG'.scan(/[\303\251]/u) => []
'aébvHögtåwHÅFuG'.scan(/[#{"\303\251"}]/u)

Click to expand...

=> ["é"]

Click to expand...

OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
character code point -- which with your explanation I can finally tie
together what that means.

Took me a second to recognize the #{} as Ruby and not some new regex I'd
never seen

And I realize now too I wasn't picking up on the use of octal vs
decimal.

Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?

Oniguruma is not in ruby 1.8 (though you can install it as a gem). It
is in 1.9.

What is happening in the first case is that the string does not contain
characters \303 or \251 because those are invalid utf8 sequences. But
when the value "\303\251" is *inlined* into the regex, that is
recognized as the utf8 character "é" and a match is found.

Click to expand...

So ranges *do* work in utf8 but you have to be careful:

"àâäçèéêîïôü".scan(/[ä-î]/u)

Click to expand...

=> ["ä", "ç", "è", "é", "ê", "î"]

"àâäçèéêîïôü".scan(/[\303\244-\303\256]/u)

Click to expand...

=> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
"\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
"\264", "\303", "\274"]

"àâäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)

Click to expand...

=> ["ä", "ç", "è", "é", "ê", "î"]

Click to expand...

Hope this helps.

Click to expand...

Yes!

-- gw

Greg Willits · Dec 4, 2007

Jordan said:
Oniguruma is not in ruby 1.8 (though you can install it as a gem). It
is in 1.9.

Oh. I always thought Oniguruma was the engine in Ruby.

Anyway -- everyone, thanks for all the input. I believe I'm headed in
the right direction now, and have a better hands on understanding of
UTF-8.

-- gw

Daniel DeLorme · Dec 4, 2007

Charles said:
Regular expressions for all character work would be a *terribly* slow
way to get things done. If you want to get the nth character, should you
do a match for n-1 characters and a group to grab the nth? Or would it
be better if you could just index into the string and have it do the

Ok, I'm not very familiar with the internal working of strings in 1.9,
but it seems to me that for character sets with variable byte size, it
is logically *impossible* to directly index into the string. Unless
there's some trick I'm unaware of, you *have* to count from the
beginning of the string for utf8 strings.

right thing? How about if you want to iterate over all characters in a
string? Should the iterating code have to know about the encoding?
Should you use a regex to peel off one character at a time?

That is certainly one possible way of doing things...
string.scan(/./){ |char| do_someting_with(char) }

Regex for string access goes a long way, but's just about the heaviest
way to do it.

Heavy compared to what? Once compiled, regex are orders of magnitude
faster than jumping in and out of ruby interpreted code.

Strings should be aware of their encoding and should be
able to provide you access to characters as easily as bytes. That's what
1.9 (and upcoming changes in JRuby) fixes.

Overall I agree that the encoding stuff in 1.9 is very nice.
Encapsulating the encoding with the string is very OO. Very intuitive.
No need to think about encoding anymore, now it "just works" for
encoding-ignorant programmers (at least until the abstraction leaks). It
allows to shut up one frequent complaint about ruby; a clear political
victory. Overall it is more robust and less error-prone than the 1.8 way.

But my point was that there *is* a 1.8 way. The thing that riled me up
and that I was responding to was the claim that 1.8 did not have unicode
support AT ALL. Unequivocally, it does, and it works pretty well for me.
IMHO there is a certain minimalist elegance in considering strings as
encoding-agnostic and using regex to get encoding-specific views. I
could do str[/./n] and str[/./u]; I can't do that anymore.

1.9 makes encodings easier for the english-speaking masses not used to
extended characters, but let's remember that ruby *always* had support
for multibyte character sets; after all it *did* originate from a
country with two gazillion "characters".

Daniel

Daniel DeLorme · Dec 4, 2007

MonkeeSage said:
Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS*
a standard mapping of bytes to *characters*. That's what unicode is.
I'm sorry you don't like that, but don't lie and say ruby 1.8 supports
unicode when it knows nothing about that standard mapping and treats
everything as individual bytes (and any byte with a value greater than
126 just prints an octal escape)

Ok, then how do you explain this:=> ["a", "b", "c", "ä"]

This doesn't require any libraries, and it seems to my eyes that ruby is
converting 5 bytes into 4 characters. It shows an awareness of utf8. If
that's not *some* kind of unicode support then please tell me what it
is. It seem were disagreeing on some basic definition of what "unicode
support" means.

Secondly, as I said in my first post to this thread, the characters
trying to be matched are composite characters, which requires you to
match both bytes. You can try to using a unicode regexp, but then you
run into the problem you mention--the regexp engine expects the pre-
composed, one-byte form...

"ò".scan(/[\303\262]/u) # => []
"ò".scan(/[\xf2]/u) # => ["\303\262"]

Wow, I never knew that second one could work. Unicode support is
actually better than I thought! You learn something new every day.

...which is why I said it's more robust to use something like the the
regexp that Jimmy linked to and I reposted, instead of a unicode
regexp.

I'm not sure what makes that huge regexp more robust than a simple
unicode regexp.

Daniel

problem matching accented chars on OS X	0	Jun 11, 2005
ruby unicode/string explosion (0xFF in utf-8)	2	Dec 11, 2010
How to play corresponding sound?	2	Jun 10, 2023
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
Problem when removing accents from a String	1	Apr 12, 2009
Upper/lowercase regex matching in unicode	1	Oct 19, 2005
[ANN] JRuby 1.4.1 - Fixes XSS Vulnerability in JRuby 1.4.0 -Recommended Upgrade	0	Apr 26, 2010
ruby script hangs on regex match	5	Jul 22, 2005

Unicode in Regex

Greg Willits

Dale Martenson

Greg Willits

MonkeeSage

Jimmy Kofler

Greg Willits

Daniel DeLorme

Daniel DeLorme

MonkeeSage

Daniel DeLorme

Greg Willits

Greg Willits

Phrogz

Phrogz

Charles Oliver Nutter

MonkeeSage

MonkeeSage

Greg Willits

Daniel DeLorme

Daniel DeLorme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads