strange effect with [:lower:] in perl

T

T. Sander

I have a strange problem with the following perl code.
It produces the output :

A : dEf
B : dBf
D : DbF

Why is there no output for the case C?

This must be a bug or what is the explanation for this behaviuor?
When I change $c to "DgF" I get the output line for C.
I think the problem always occur for lower when the lower character is the
successor of the upper-case character.
Why this doesn't happen with the same upper variant?

I have tested this with different perl version 5.5, 5.8 on Solaris and Windows.

--------------------
$a="dEf";
$b="dBf";

if (not ($a=~/[:upper:]/)) {
print "A : $a\n";}


if (not ($b=~/[:upper:]/)) {
print "B : $b\n";}


$c="DeF";
$d="DbF";

if (not ($c=~/[:lower:]/)) {
print "C : $c\n";}

if (not ($d=~/[:lower:]/)) {
print "D : $d\n";}

-------------------
 
T

T. Sander

Abigail said:
T. Sander ([email protected]) wrote on MMMDCCV September MCMXCIII in
<URL:** I have a strange problem with the following perl code.
** It produces the output :
**
** A : dEf
** B : dBf
** D : DbF
**
** Why is there no output for the case C?
**
** This must be a bug or what is the explanation for this behaviuor?
** When I change $c to "DgF" I get the output line for C.
** I think the problem always occur for lower when the lower character is the
** successor of the upper-case character.
** Why this doesn't happen with the same upper variant?
**
** I have tested this with different perl version 5.5, 5.8 on Solaris and Windows.
**
** --------------------
** $a="dEf";
** $b="dBf";
**
** if (not ($a=~/[:upper:]/)) {
** print "A : $a\n";}
**
**
** if (not ($b=~/[:upper:]/)) {
** print "B : $b\n";}
**
**
** $c="DeF";
** $d="DbF";
**
** if (not ($c=~/[:lower:]/)) {
** print "C : $c\n";}
**
** if (not ($d=~/[:lower:]/)) {
** print "D : $d\n";}
**

/[:lower:]/ matches if the string contains a ':', an 'l', an 'o', a 'w',
an 'e' or an 'r'. "DeF" contains an 'e', so it does match.

What you probably want is /[[:lower:]]/.


Abigail



Thank you for your reply. Now it works.
But why have I to use two [?
Where is the difference between [a-z] and [:lower:]? For [a-z] I get the range
of letters and not only 'a','-' and 'z'. Why not the same for [:lower:]?
In which case can I use only [:lower:] without an additional [] pair?
 
A

Anno Siegel

T. Sander said:
Abigail said:
T. Sander ([email protected]) wrote on MMMDCCV September MCMXCIII in
<URL:news:[email protected]>:
[...]
/[:lower:]/ matches if the string contains a ':', an 'l', an 'o', a 'w',
an 'e' or an 'r'. "DeF" contains an 'e', so it does match.

What you probably want is /[[:lower:]]/.


Abigail



Thank you for your reply. Now it works.
But why have I to use two [?

"Why" isn't a sensible question to ask at this point. The only possible
answer is, "Because whoever implemented it made it that way". About the
reasons we can only speculate.
Where is the difference between [a-z] and [:lower:]? For [a-z] I get the range
of letters and not only 'a','-' and 'z'. Why not the same for [:lower:]?

That would have introduced ":" as a new metacharacter in character classes,
breaking old programs.
In which case can I use only [:lower:] without an additional [] pair?

You can't. The [:<anything>:] construct is only valid inside character
classes.

Anno
 
A

Alan J. Flavell

Because now you *can* do the same as with ranges: you can combine them.

[a-z0-9] # Lowercase letters *and* digits.

Surely that only refers to a subset of what Unicode considers to be
"letters"?
 
B

Ben Morrow

Alan J. Flavell ([email protected]) wrote on MMMDCCIX September
MCMXCIII in <URL:ppepc56.ph.gla.ac.uk>:
"" On Mon, 27 Oct 2003, Abigail wrote:
""
"" > Because now you *can* do the same as with ranges: you can combine them.
"" >
"" > [a-z0-9] # Lowercase letters *and* digits.
""
"" Surely that only refers to a subset of what Unicode considers to be
"" "letters"?


Yeah, but that's what [:lower:] seems to do too:

$ perl -wle 'for (0x00 .. 0x80) {
printf "%02x %s\n", $_, chr if chr () =~ /[[:lower:]]/}'
No lowercase accented letters here.

Now extend that up to 0x120 or so, with perl5.8.

Ben
 
A

Alan J. Flavell

"" > [a-z0-9] # Lowercase letters *and* digits.
""
"" Surely that only refers to a subset of what Unicode considers to be
"" "letters"?

Yeah, but that's what [:lower:] seems to do too:

$ perl -wle 'for (0x00 .. 0x80) {

Surely you meant to set the limit at 0xff or so for this
demonstration?
printf "%02x %s\n", $_, chr if chr () =~ /[[:lower:]]/}'
[snip]

No lowercase accented letters here.

Curious. No surprise when the limit's set at 0x80, as I'm sure you'd
agree; but I must admit I was surprised at the accented lower-case
letters up to 0xff not being counted, despite the accented lower case
letters above 0x100 being counted. Prima facie I think there's
something wrong here, no? (This is perl 5.8.0 per RedHat 9).

If I set the upper limit at, say, 0xfff, then I get lots of lower-case
letters reported in the blocks of extended Latin, Greek, Coptic,
Cyrillic and Armenian.

And there are more still, e.g 0x2149 "DOUBLE-STRUCK ITALIC SMALL J"
;-)
 
B

Ben Morrow

Alan J. Flavell said:
"" > [a-z0-9] # Lowercase letters *and* digits.
""
"" Surely that only refers to a subset of what Unicode considers to be
"" "letters"?

Yeah, but that's what [:lower:] seems to do too:

$ perl -wle 'for (0x00 .. 0x80) {

Surely you meant to set the limit at 0xff or so for this
demonstration?
printf "%02x %s\n", $_, chr if chr () =~ /[[:lower:]]/}'
[snip]

No lowercase accented letters here.

Curious. No surprise when the limit's set at 0x80, as I'm sure you'd
agree; but I must admit I was surprised at the accented lower-case
letters up to 0xff not being counted, despite the accented lower case
letters above 0x100 being counted. Prima facie I think there's
something wrong here, no? (This is perl 5.8.0 per RedHat 9).

If I set the upper limit at, say, 0xfff, then I get lots of lower-case
letters reported in the blocks of extended Latin, Greek, Coptic,
Cyrillic and Armenian.

This confused me as well, at first; try

% perl -wle'binmode STDOUT, ":utf8" for(0x00 .. 0xFF) { printf "%02x \
%s\n", $_, chr if substr(chr() . "\x{100}", 0, 1)\ =~ \
/[[:lower:]]/}'

, the sole point of the \x{100} being to upgrade the string to
Unicode...

Some of the characters produced still confuse me rather, though, such
as U+00AA FEMININE ORDINAL INDICATOR, but I guess Perl's just
returning what's in the Unicode standard.

Ben
 
A

Alan J. Flavell

This confused me as well, at first; try

% perl -wle'binmode STDOUT, ":utf8" for(0x00 .. 0xFF) { printf "%02x \
%s\n", $_, chr if substr(chr() . "\x{100}", 0, 1)\ =~ \
/[[:lower:]]/}'

, the sole point of the \x{100} being to upgrade the string to
Unicode...

Oh yes: the /[[:lower:]]/ regex fails to work when it's fed
non-upgraded iso-8859-1 characters, but works fine after forcing
the "upgrade".

That would surely have to be categorised as a bug?
Some of the characters produced still confuse me rather, though, such
as U+00AA FEMININE ORDINAL INDICATOR,

Yes, the feminine and masculine ordinals are formed from lower case
"a" and "o", so it's plausible at least.
but I guess Perl's just returning what's in the Unicode standard.

Right, that would be the "Ll" indicator in the third field of the
Unicode character data, e.g for version 3 of Unicode see
http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
(beware: large file!). See the explanation at
http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html ,
field number 2 "General Category". (I'm not sure which version of
Unicode they implement in 5.8.0, sorry)

By the way, just to show myself that it's feasible, I'm sitting at a
Windows/NT station, running the "putty" ssh client to connect to
redhat 9, and with "putty" configured to use a monospaced
Unicode-capable font, and utf-8 encoding. Seems to work fine for
display ;-)
 
B

Ben Morrow

Alan J. Flavell said:
This confused me as well, at first; try

% perl -wle'binmode STDOUT, ":utf8" for(0x00 .. 0xFF) { printf "%02x \
%s\n", $_, chr if substr(chr() . "\x{100}", 0, 1)\ =~ \
/[[:lower:]]/}'

, the sole point of the \x{100} being to upgrade the string to
Unicode...

Oh yes: the /[[:lower:]]/ regex fails to work when it's fed
non-upgraded iso-8859-1 characters, but works fine after forcing
the "upgrade".

That would surely have to be categorised as a bug?

No: well, at any rate, it's intentional and documented. The aim is
that non-Unicode-aware programs being fed non-Unicode data carry on
working as before 5.6, and before 5.6 [[:lower:]] meant the same as
[a-z] unless you used locale.

Ben
 
A

Alan J. Flavell

Oh yes: the /[[:lower:]]/ regex fails to work when it's fed
non-upgraded iso-8859-1 characters, but works fine after forcing
the "upgrade".

That would surely have to be categorised as a bug?

No: well, at any rate, it's intentional and documented.

I appreciate the correction. (If you check the archives of this group
for quite some months now you'll maybe get the impression that I was
one of the few attempting to answer unicode-related questions - at
least that was starting to be /my/ impression - despite me being at
only a relatively early stage of getting to tangle with the stuff in
Perl. I sure appreciate seeing some informed input from others such
as yourself.)
The aim is that non-Unicode-aware programs being fed non-Unicode
data carry on working as before 5.6,

I suppose that's understandable-ish.
and before 5.6 [[:lower:]]
meant the same as [a-z] unless you used locale.

Well, must admit I was blissfully unaware of the existence of those
POSIX regex constructs in versions of Perl before 5.6, but I take your
point. As a 5.6 document succintly puts it:

| This document varies from difficult to understand to completely and
| utterly opaque.

Ho hum...

thanks again
 
B

Ben Morrow

Alan J. Flavell said:
and before 5.6 [[:lower:]]
meant the same as [a-z] unless you used locale.

Well, must admit I was blissfully unaware of the existence of those
POSIX regex constructs in versions of Perl before 5.6, but I take your
point.

You are of course correct... /me gets a slap on the wrist for not
checking. :)

Let us say "5.6 when not under the 'utf8' pragma", then; the behaviour
is probably the most 'correct' for those assumptions. It is also
consistent with '\w' under 5.005, which didn't match accented
characters either, unless you used an appropriate locale.

Ben
 
A

Alan J. Flavell

Let us say "5.6 when not under the 'utf8' pragma", then; the behaviour
is probably the most 'correct' for those assumptions. It is also
consistent with '\w' under 5.005, which didn't match accented
characters either, unless you used an appropriate locale.

This is odd. If I execute this code which we discussed before:

for (0x00 .. 0xff) { printf "%02x %s\n" , $_, chr if chr () =~ /[[:lower:]]/}

but with "use locale" in effect, then on a RedHat 7.2 system, it
reports the accented lower-case letters also. LC_CTYPE is en_GB, and
Perl is RH perl-5.6.1-36.1.72

If I execute the *same* script in RedHat 9, then - even with "use
locale" in effect - it reverts to the old behaviour - nothing above
'z' is reported as a lower-case letter.

LC_CTYPE is "en_GB.UTF-8", and Perl is RH perl-5.8.0-88.3

However, if I set the locale to "en_GB" etc. then the extended
behaviour re-appears: accented lower-case letters are also reported.
Same for "en_GB.ISO8859-1" etc.

If I go back to the RH7.2/Perl5.6.1 system and explicitly setlocale()
to "en_GB.UTF-8", then the 5.6.1 system reports only a-z as lower
case, just as the 5.8.0 one did.


Could I summarise that by saying (applies to both versions):

* if the locale does not include utf-8, then "use locale" switches on
the reporting of lower-case accented letters.

This is what you already explained as being a compatibility feature
in the absence of "use locale", right?

* but if the locale _does_ imply utf-8, then it seems something
different happens. In this test, "use locale" doesn't report accented
lower-case letters, in either Perl version.

As we saw in the earlier discussion: if the string has been forcibly
upgraded to Perl's unicode format, then those accented letters were
reported, irrespective of "use locale", which is fine by me.

But it seems that if the string has not been upgraded to unicode
format, then even with "use locale" in effect, the accented letters
are not reported - this bit seems, at least, unintuitive (even a
mistake?).

Are my observations correct? Any insights?
 
B

Ben Morrow

Alan J. Flavell said:
On Wed, 29 Oct 2003, Ben Morrow wrote:

This is odd. If I execute this code which we discussed before:
Could I summarise that by saying (applies to both versions):

* if the locale does not include utf-8, then "use locale" switches on
the reporting of lower-case accented letters.

This is what you already explained as being a compatibility feature
in the absence of "use locale", right?
Yup.

* but if the locale _does_ imply utf-8, then it seems something
different happens. In this test, "use locale" doesn't report accented
lower-case letters, in either Perl version.

As we saw in the earlier discussion: if the string has been forcibly
upgraded to Perl's unicode format, then those accented letters were
reported, irrespective of "use locale", which is fine by me.

But it seems that if the string has not been upgraded to unicode
format, then even with "use locale" in effect, the accented letters
are not reported - this bit seems, at least, unintuitive (even a
mistake?).

Are my observations correct? Any insights?

Well, what you say certainly holds on my machine as well... I think
the answer to this is in perlunicode:

| BUGS
| Interaction with Locales
|
| Use of locales with Unicode data may lead to odd results.
| [...] Use of locales with Unicode is discouraged.

and yes, it probably is a bug. Certainly, a UTF8 locale is treated
qualitatively differently from any other.

What seems to be happening in that in 5.6 'use locale' with a UTF8
locale is treated identically to 'use utf8', and in 5.8 it is ignored
(at least as far as character sets/encodings are concerned); perl then
treats all non-upgraded data as though locale support wasn't present,
and assumes it's encoded in iso8859-1 when it needs to be upgraded.

This is arguably incorrect :), but I guess it's a reasonable
compromise. It would be nice to have a 'all data has the utf8 flag on,
all the time, except under 'use bytes'' pragma; or is this what the
new -C flag (or having a UTF8 locale in 5.8.0) does, in effect?

The Right Answer, I guess, is this:

Under 'no locale':

* Upgraded data is in utf8. [[:lower:]] et al match exactly the same
as \p{Ll}: i.e., by the definitions given in the Unicode database.

* All non-upgraded data is considered to be ASCII[2]. Strings
containing top-bit-set bytes are binary, and cannot be
upgraded... or maybe all the top-bit-set chars are upgraded to
their corresponding Unicode codepoints, with or without a
warning.

I don't like the current 'let's just randomly assume iso8859-1'
approach. I would like to say that top-bit-set chars should all be
upgraded to U+FFFD, but I feel this might cause problems... :)

* Since non-upgraded data is ASCII, [[:lower:]] == [a-z] [3]. Matching
against \p{Ll} causes the data to be upgraded (if you're using
Unicode-y operators, you can't object to Perl upgrading), and
matched against the Unicode database.

Under 'use locale':

* Upgraded data is utf8. Non-upgraded data (when treated as text) is
considered to be encoded as the charset[1] portion of the locale,
and is upgraded to utf8 on that basis when necessary.

* [[:lower:]] != \p{Ll}. [[:lower:]] matches (character set implied
by locale) intersect (\p{Ll}), on both non- and upgraded data.

* Opened filehandles have an appropriate :encoding() layer
automatically pushed.

Under 'use bytes' (which overrides 'use locale'):

* All data is considered to be binary, and the use of any text-y
regex components such as [[:lower:]] or \p is an error. [a-z] is
interpreted as [\x61-\x7a] (or the equivalent EBCDIC).

* Opened filehandles have :raw automatically pushed.

locale should have an two functions, locale::to_local and
locale::from_local which work identically to Encode::(en|de)code with
the appropriate encoding supplied.

Hmm, wonder what p5p's opinion on all that would be? "Go away, it's
working now, the right time to have said this was some time ago" would
certainly be fair enough... :)

Ben

[1] ...in the MIME sense, i.e. an encoding. I am aware of the
difference, it's just tiresome to be Correct all the time :).

[2] or EBCDIC, as appropriate, throughout.

[3] or rather, [abcd...xyz], to account for EBCDIC.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top