Puzzling regex behaviour

R

Rob Biedenharn

Why don't you just find out which characters are in the [:alnum:] and
\w sets?

$ LANG=nl_NL irb
alnums = (0..0377).select {|c| c.chr =~ /[[:alnum:]]/ }.map {|c|
c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
\252
\265\272\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316
\317\320\321\322\323\324\325\326\330\331\332\333\334\335\336\337\340
\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361
\362\363\364\365\366\370\371\372\373\374\375\376\377"
dubyas = (0..0377).select {|c| c.chr =~ /\w/ }.map {|c|c.chr}.join
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"

Yes, but all this really does is indicate that the irb behaviour is
the correct one.

When I run this in a stand-alone script, I get this:

$ LANG=nl_NL ./foo
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

It's almost as if the locale isn't being propagated to the process via
the environment. But...

$ LANG=nl_NL ruby -e "puts ENV['LANG']"
nl_NL

...it _is_ being propagated.

Is is the same for you?

Ian
--
Ian Macdonald | When a man is tired of London, he is
tired
(e-mail address removed) | of life. -- Samuel Johnson
http://www.caliban.org/ |

Yes, the LANG is affecting the result in irb, but not ruby.

$ irb -v
irb 0.9.5(05/04/13)

Whether the irb behavior is "correct" or anomalous is probably a
question for the maintainers to debate. The man page for ctype(3)
(on my Mac OS X 10.4.8) indicates that the macros are supposed to be
based on the locale and my copy of the pickaxe (p.71) says that the
character classes are based on the ctype macros of the same name.
However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I'm now more curious as to how irb is finding the character classes.

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
I

Ian Macdonald

Yes, the LANG is affecting the result in irb, but not ruby.

$ irb -v
irb 0.9.5(05/04/13)

Whether the irb behavior is "correct" or anomalous is probably a
question for the maintainers to debate. The man page for ctype(3)
(on my Mac OS X 10.4.8) indicates that the macros are supposed to be
based on the locale and my copy of the pickaxe (p.71) says that the
character classes are based on the ctype macros of the same name.
However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I'm now more curious as to how irb is finding the character classes.

It turns out that the poster who mentioned possible interference from
the readline(3) library was right.

Look at this:

$ irb
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil

$ irb --noreadline
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> 2

This is _very_ unexpected and undesirable behaviour and, as such,
probably qualifies as a bug.

Interestingly, adding "require 'readline'" to the stand-alone script
does _not_ introduce this behaviour, so it must be something to do with
the initialisation that irb does.

Ian
--
Ian Macdonald | I like your game but we have to change the
(e-mail address removed) | rules.
http://www.caliban.org/ |
|
|
 
R

Robert Klemme

Yes, the LANG is affecting the result in irb, but not ruby.

$ irb -v
irb 0.9.5(05/04/13)

Whether the irb behavior is "correct" or anomalous is probably a
question for the maintainers to debate. The man page for ctype(3)
(on my Mac OS X 10.4.8) indicates that the macros are supposed to be
based on the locale and my copy of the pickaxe (p.71) says that the
character classes are based on the ctype macros of the same name.
However, a quick C program shows effectively the same behavior as
ruby (i.e., only the [0-9A-Za-z] satisfy isalnum() even for nl_NL).
I'm now more curious as to how irb is finding the character classes.

It turns out that the poster who mentioned possible interference from
the readline(3) library was right.

That was me. :)
Look at this:

$ irb
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil

$ irb --noreadline
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> 2

This is _very_ unexpected and undesirable behaviour and, as such,
probably qualifies as a bug.

Yeah, seems so. Unless it's documented behavior. :)
Interestingly, adding "require 'readline'" to the stand-alone script
does _not_ introduce this behaviour, so it must be something to do with
the initialisation that irb does.

It's really strange as both print the same output. How about doing this
- just to be sure that both strings contain the same sequence of bytes:

require 'enumerator'
foo.to_enum:)each_byte).to_a.join(", ")

Kind regards

robert
 
I

Ian Macdonald

Look at this:

$ irb
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> nil

$ irb --noreadline
irb(main):001:0> foo = "préférées"
=> "pr\351f\351r\351es"
irb(main):002:0> foo =~ /[^[:alnum:]]/
=> 2

This is _very_ unexpected and undesirable behaviour and, as such,
probably qualifies as a bug.

Yeah, seems so. Unless it's documented behavior. :)
Interestingly, adding "require 'readline'" to the stand-alone script
does _not_ introduce this behaviour, so it must be something to do with
the initialisation that irb does.

It's really strange as both print the same output.

You mean that both of them show foo to contain the same string of bytes?
How about doing this
- just to be sure that both strings contain the same sequence of bytes:

require 'enumerator'
foo.to_enum:)each_byte).to_a.join(", ")

In both cases:

=> "112, 114, 233, 102, 233, 114, 233, 101, 115"

Somehow, it is the regex that is being handled differently, not the
string.

Ian
--
Ian Macdonald | Reality does not exist -- yet.
(e-mail address removed) |
http://www.caliban.org/ |
|
|
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top