R
Richard Wiseman
Hi,
I've found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:
$KCODE = 'u'
require 'jcode'
text = "\xa3A\nB\n\xa3C\nxD\nE"
# This pattern finds all lines that intuitively should match it.
puts "Pattern includes \"(?:x|\xa3)?\":"
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }
# This pattern finds all lines except the one containing the C, which is
# contrary to my intuition. I'd expect it to match all lines or, if I
were
# really paranoid about Unicode, I *might* expect it to match all but
the
# lines containing A and C.
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }
The output of this is:
Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
xD
E
Without the first two (Unicode-specifying) lines, the output is what I
expect:
Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
úC
xD
E
(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the ú matches ONLY where it's
the very first thing in the string.
Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I'm doing?
Thanks!
Richard
I've found some strange and unexpected behaviour to do with pattern
matching when I use Unicode. My example code follows and contains
comments to suggest what I think should happen:
$KCODE = 'u'
require 'jcode'
text = "\xa3A\nB\n\xa3C\nxD\nE"
# This pattern finds all lines that intuitively should match it.
puts "Pattern includes \"(?:x|\xa3)?\":"
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }
# This pattern finds all lines except the one containing the C, which is
# contrary to my intuition. I'd expect it to match all lines or, if I
were
# really paranoid about Unicode, I *might* expect it to match all but
the
# lines containing A and C.
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }
The output of this is:
Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
xD
E
Without the first two (Unicode-specifying) lines, the output is what I
expect:
Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
úC
xD
E
(Notice the extra line in the second half.) The thing I think is
bizarre is that if Unicode is being used, the ú matches ONLY where it's
the very first thing in the string.
Is there something funny about Unicode characters when using character
classes? Is this a known issue, or is it something weird and/or
ignorant that I'm doing?
Thanks!
Richard