Unicode and Character Classes -- a bug?

R

Richard Wiseman

Hi,

I've found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:


$KCODE = 'u'
require 'jcode'

text = "\xa3A\nB\n\xa3C\nxD\nE"

# This pattern finds all lines that intuitively should match it.
puts "Pattern includes \"(?:x|\xa3)?\":"
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }

# This pattern finds all lines except the one containing the C, which is
# contrary to my intuition. I'd expect it to match all lines or, if I
were
# really paranoid about Unicode, I *might* expect it to match all but
the
# lines containing A and C.
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }


The output of this is:


Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
xD
E


Without the first two (Unicode-specifying) lines, the output is what I
expect:


Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
úC
xD
E


(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the ú matches ONLY where it's
the very first thing in the string.

Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I'm doing?

Thanks!

Richard
 
M

MonkeeSage

Hi Richard,

It appears that you were spot-on with your guess about wonky things
happening in character classes. Seemingly hex escape codes aren't
allowed there. You'll have to either use a literal character, or if
that isn't possible, do something ugly like this:

/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

Regards,
Jordan
 
R

Richard Wiseman

Jordan said:
/^[x#{"\xa3"}]?[A-Z]$/ # note the interpolation

There might be another solution, hopefully so, but this should at least
work if nothing else turns up.

I hadn't thought of that one - thanks for the suggestion! The simplest
(working) alternative I could think of was the parenthesised list of
individual characters as shown in the first half of the example code.
 
D

Daniel DeLorme

Richard said:
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }

That is very weird indeed. It's normal that your example doesn't work, because
\xa3 is NOT valid utf8. But I would've expected it to work if you used the
correct utf8 sequence for "ú" ("\xc3\xba"), except it doesn't!

$KCODE='u'
=> "u"
text = "\xc3\xbaA\nB\n\xc3\xbaC\nxD\nE"
=> "úA\nB\núC\nxD\nE"
text.scan(/^[xú]?[A-Z]$/)
=> ["úA", "B", "úC", "xD", "E"]
text.scan(/^[x\xc3\xba]?[A-Z]$/)
=> ["B", "xD", "E"]

WTF? Can anyone explain this?
 
M

MonkeeSage

Daniel said:
That is very weird indeed. It's normal that your example doesn't work, because
\xa3 is NOT valid utf8. But I would've expected it to work if you used the
correct utf8 sequence for "ú" ("\xc3\xba"), except it doesn't!

That shouldn't matter. He was matching the same hex escape he used in
his string (viz., \xa3). It shouldn't matter whether it's unicode or
just random data; the match should go through (or fail) in either case.
WTF? Can anyone explain this?

Not really, because I don't understand Oniguruma (the regexp engine);
I'm barely smart enough to _use_ regexps. ;) But seemingly, you can't
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan
 
V

Verno Miller

Jordan said:
Daniel DeLorme wrote:

...


Not really, because I don't understand Oniguruma (the regexp engine);
I'm barely smart enough to _use_ regexps. ;) But seemingly, you can't
use hex escapes in character classes, so you have to use the literal or
do other things to work around it (see last two posts above).

Regards,
Jordan


Just a pointer to some examples how to parse UTF-8 encoded strings in
Ruby:

http://bigbold.com/snippets/posts/show/1659
 
V

Verno Miller

Jordan said:
Hi Verno,

I used to have a class that used that technique to fake UTF-8 support.
I now use Nikolai Weibull's extension
(http://rubyforge.org/projects/char-encodings).

Regards,
Jordan


Thanks for this one, Jordan! I seem to have missed some stuff on
redhanded as of late, esp.

http://redhanded.hobix.com/inspect/nikolaiSUtf8LibIsAllReady.html

For some info on Oniguruma btw I've run across this page:

http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt

I've played with the u option regex hack quite a while back (seemed to
be working pretty well even with some Japanese chars if i remember
correctly), so i just thought to throw it in as a tip.

Thanks, again, for the update to Nikolai Weibull's extension!

Cheers,
Verno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top