problem with \s in unicoded regular expressions

S

Sergei Olonichev

Hello,

I have found a problem with character classes definition in unicoded
regular expressions. It seems \s isn't defined properly.

See the following simple program which ought to change space symbols
into "line feed":
cat test.utf8 | ruby-1.8.0 -Ku -ne '$_.gsub(/[\s]+/u,"\n"); puts $_;'

test.utf8 contains the following in hex:
C2 A0 32 33 20 31 0A C2 A0 32 34 20 31 0A

which is UTF8 code for:
00A0 NS no-break space
0032 2 digit two
0033 3 digit three
0020 SP space
0031 1 digit one
000A LF line feed (lf)
00A0 NS no-break space
0032 2 digit two
0034 4 digit four
0020 SP space
0031 1 digit one
000A LF line feed (lf)

But Ruby does not make any changes (does not change "no-break space"
into "line feed")!
Is that a bug?


Best wishes,
Sergei
 
S

Simon Strandgaard

Hello,

I have found a problem with character classes definition in unicoded
regular expressions. It seems \s isn't defined properly.

See the following simple program which ought to change space symbols
into "line feed":
cat test.utf8 | ruby-1.8.0 -Ku -ne '$_.gsub(/[\s]+/u,"\n"); puts $_;'

test.utf8 contains the following in hex:
C2 A0 32 33 20 31 0A C2 A0 32 34 20 31 0A

which is UTF8 code for:
00A0 NS no-break space
0032 2 digit two
0033 3 digit three
0020 SP space
0031 1 digit one
000A LF line feed (lf)
00A0 NS no-break space
0032 2 digit two
0034 4 digit four
0020 SP space
0031 1 digit one
000A LF line feed (lf)

But Ruby does not make any changes (does not change "no-break space"
into "line feed")!
Is that a bug?

No..


server> ruby u.rb
"+)!! !\036+)!! !\036"
"+)!!\n!\036+)!!\n!\036"
server> cat u.rb
input = %w(C2 A0 32 33 20 31 0A C2 A0 32 34 20 31 0A)
str = input.map{|i| i.unpack('H2')[0].to_i.chr}.join
p str
p str.gsub(/[\s]+/u,"\n")
server>

I see no problems with regexp \s..
 
S

Simon Strandgaard

Hello,

I have found a problem with character classes definition in unicoded
regular expressions. It seems \s isn't defined properly.

See the following simple program which ought to change space symbols
into "line feed":
cat test.utf8 | ruby-1.8.0 -Ku -ne '$_.gsub(/[\s]+/u,"\n"); puts $_;'

test.utf8 contains the following in hex:
C2 A0 32 33 20 31 0A C2 A0 32 34 20 31 0A

which is UTF8 code for:
00A0 NS no-break space
0032 2 digit two
0033 3 digit three
0020 SP space
0031 1 digit one
000A LF line feed (lf)
00A0 NS no-break space
0032 2 digit two
0034 4 digit four
0020 SP space
0031 1 digit one
000A LF line feed (lf)

But Ruby does not make any changes (does not change "no-break space"
into "line feed")!
Is that a bug?

No..


server> ruby u.rb
"+)!! !\036+)!! !\036"
"+)!!\n!\036+)!!\n!\036"
server> cat u.rb
input = %w(C2 A0 32 33 20 31 0A C2 A0 32 34 20 31 0A)
str = input.map{|i| i.unpack('H2')[0].to_i.chr}.join
p str
p str.gsub(/[\s]+/u,"\n")
server>

I see no problems with regexp \s..

Hmmm.. there is something wrong with my code .. me sorry, too quick.
My hex2utf8 conversion is buggy.. anyone who knows a smarter way to do
this ?
 
S

Sergei Olonichev

Sergei said:
Hello,

I have found a problem with character classes definition in unicoded
regular expressions. It seems \s isn't defined properly.

See the following simple program which ought to change space symbols
into "line feed":
cat test.utf8 | ruby-1.8.0 -Ku -ne '$_.gsub(/[\s]+/u,"\n"); puts $_;'

I have forgotten "!" after "gsub" here in this letter but not when I
tested this problem, so "!" does not help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top