Unicode and Character Classes -- a bug?

Richard Wiseman · Sep 19, 2006

Hi,

I've found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:

$KCODE = 'u'
require 'jcode'

text = "\xa3A\nB\n\xa3C\nxD\nE"

# This pattern finds all lines that intuitively should match it.
puts "Pattern includes \"(?:x|\xa3)?\":"
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }

# This pattern finds all lines except the one containing the C, which is
# contrary to my intuition. I'd expect it to match all lines or, if I
were
# really paranoid about Unicode, I *might* expect it to match all but
the
# lines containing A and C.
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }

The output of this is:

Pattern includes "(?:x|Ãº)?":
ÃºA
B
ÃºC
xD
E
Pattern includes "[xÃº]":
ÃºA
B
xD
E

Without the first two (Unicode-specifying) lines, the output is what I
expect:

Pattern includes "(?:x|Ãº)?":
ÃºA
B
ÃºC
xD
E
Pattern includes "[xÃº]":
ÃºA
B
ÃºC
xD
E

(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the Ãº matches ONLY where it's
the very first thing in the string.

Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I'm doing?

Thanks!

Richard

MonkeeSage · Sep 19, 2006

Hi Richard,

It appears that you were spot-on with your guess about wonky things
happening in character classes. Seemingly hex escape codes aren't
allowed there. You'll have to either use a literal character, or if
that isn't possible, do something ugly like this:

/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

Regards,
Jordan

Richard Wiseman · Sep 19, 2006

Jordan said:
/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

I hadn't thought of that one - thanks for the suggestion! The simplest
(working) alternative I could think of was the parenthesised list of
individual characters as shown in the first half of the example code.

Daniel DeLorme · Sep 21, 2006

Richard said:
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }

That is very weird indeed. It's normal that your example doesn't work, because
\xa3 is NOT valid utf8. But I would've expected it to work if you used the
correct utf8 sequence for "Ãº" ("\xc3\xba"), except it doesn't!

$KCODE='u'
=> "u"
text = "\xc3\xbaA\nB\n\xc3\xbaC\nxD\nE"
=> "ÃºA\nB\nÃºC\nxD\nE"
text.scan(/^[xÃº]?[A-Z]$/)
=> ["ÃºA", "B", "ÃºC", "xD", "E"]
text.scan(/^[x\xc3\xba]?[A-Z]$/)
=> ["B", "xD", "E"]

WTF? Can anyone explain this?

MonkeeSage · Sep 21, 2006

Daniel said:
That is very weird indeed. It's normal that your example doesn't work, because
\xa3 is NOT valid utf8. But I would've expected it to work if you used the
correct utf8 sequence for "ú" ("\xc3\xba"), except it doesn't!

That shouldn't matter. He was matching the same hex escape he used in
his string (viz., \xa3). It shouldn't matter whether it's unicode or
just random data; the match should go through (or fail) in either case.

WTF? Can anyone explain this?

Not really, because I don't understand Oniguruma (the regexp engine);
I'm barely smart enough to _use_ regexps.

But seemingly, you can't
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan

Verno Miller · Sep 21, 2006

Jordan said:
Daniel DeLorme wrote:

...

Not really, because I don't understand Oniguruma (the regexp engine);
I'm barely smart enough to _use_ regexps. But seemingly, you can't
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan

Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

http://bigbold.com/snippets/posts/show/1659

MonkeeSage · Sep 21, 2006

Verno said:
Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

Hi Verno,

I used to have a class that used that technique to fake UTF-8 support.
I now use Nikolai Weibull's extension
(http://rubyforge.org/projects/char-encodings).

Regards,
Jordan

Verno Miller · Sep 21, 2006

Jordan said:
Hi Verno,

I used to have a class that used that technique to fake UTF-8 support.
I now use Nikolai Weibull's extension
(http://rubyforge.org/projects/char-encodings).

Regards,
Jordan

Thanks for this one, Jordan! I seem to have missed some stuff on
redhanded as of late, esp.

http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

For some info on Oniguruma btw I've run across this page:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

I've played with the u option regex hack quite a while back (seemed to
be working pretty well even with some Japanese chars if i remember
correctly), so i just thought to throw it in as a tip.

Thanks, again, for the update to Nikolai Weibull's extension!

Cheers,
Verno

MonkeeSage · Sep 21, 2006

Verno said:
Thanks for this one, Jordan! I seem to have missed some stuff on
redhanded as of late, esp.

NP

For some info on Oniguruma btw I've run across this page:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

And thank YOU for this Verno! Oniguruma cheet sheet. That's sweet!!

Regards,
Jordan

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023
How to replace UniCode representation with actual character?	6	Dec 18, 2013
What is the most astounding C++ syntax construct?	0	Dec 22, 2022
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
character classes, locale and utf8 - strange behaviour	0	Apr 29, 2011
Unicode help please	5	Oct 19, 2013
Blue J Ciphertext Program	2	Nov 22, 2023

Unicode and Character Classes -- a bug?

Richard Wiseman

MonkeeSage

Richard Wiseman

Daniel DeLorme

MonkeeSage

Verno Miller

MonkeeSage

Verno Miller

MonkeeSage

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads