Alan J. Flavell said:
On Wed, 29 Oct 2003, Ben Morrow wrote:
This is odd. If I execute this code which we discussed before:
Could I summarise that by saying (applies to both versions):
* if the locale does not include utf-8, then "use locale" switches on
the reporting of lower-case accented letters.
This is what you already explained as being a compatibility feature
in the absence of "use locale", right?
Yup.
* but if the locale _does_ imply utf-8, then it seems something
different happens. In this test, "use locale" doesn't report accented
lower-case letters, in either Perl version.
As we saw in the earlier discussion: if the string has been forcibly
upgraded to Perl's unicode format, then those accented letters were
reported, irrespective of "use locale", which is fine by me.
But it seems that if the string has not been upgraded to unicode
format, then even with "use locale" in effect, the accented letters
are not reported - this bit seems, at least, unintuitive (even a
mistake?).
Are my observations correct? Any insights?
Well, what you say certainly holds on my machine as well... I think
the answer to this is in perlunicode:
| BUGS
| Interaction with Locales
|
| Use of locales with Unicode data may lead to odd results.
| [...] Use of locales with Unicode is discouraged.
and yes, it probably is a bug. Certainly, a UTF8 locale is treated
qualitatively differently from any other.
What seems to be happening in that in 5.6 'use locale' with a UTF8
locale is treated identically to 'use utf8', and in 5.8 it is ignored
(at least as far as character sets/encodings are concerned); perl then
treats all non-upgraded data as though locale support wasn't present,
and assumes it's encoded in iso8859-1 when it needs to be upgraded.
This is arguably incorrect

, but I guess it's a reasonable
compromise. It would be nice to have a 'all data has the utf8 flag on,
all the time, except under 'use bytes'' pragma; or is this what the
new -C flag (or having a UTF8 locale in 5.8.0) does, in effect?
The Right Answer, I guess, is this:
Under 'no locale':
* Upgraded data is in utf8. [[:lower:]] et al match exactly the same
as \p{Ll}: i.e., by the definitions given in the Unicode database.
* All non-upgraded data is considered to be ASCII[2]. Strings
containing top-bit-set bytes are binary, and cannot be
upgraded... or maybe all the top-bit-set chars are upgraded to
their corresponding Unicode codepoints, with or without a
warning.
I don't like the current 'let's just randomly assume iso8859-1'
approach. I would like to say that top-bit-set chars should all be
upgraded to U+FFFD, but I feel this might cause problems...
* Since non-upgraded data is ASCII, [[:lower:]] == [a-z] [3]. Matching
against \p{Ll} causes the data to be upgraded (if you're using
Unicode-y operators, you can't object to Perl upgrading), and
matched against the Unicode database.
Under 'use locale':
* Upgraded data is utf8. Non-upgraded data (when treated as text) is
considered to be encoded as the charset[1] portion of the locale,
and is upgraded to utf8 on that basis when necessary.
* [[:lower:]] != \p{Ll}. [[:lower:]] matches (character set implied
by locale) intersect (\p{Ll}), on both non- and upgraded data.
* Opened filehandles have an appropriate :encoding() layer
automatically pushed.
Under 'use bytes' (which overrides 'use locale'):
* All data is considered to be binary, and the use of any text-y
regex components such as [[:lower:]] or \p is an error. [a-z] is
interpreted as [\x61-\x7a] (or the equivalent EBCDIC).
* Opened filehandles have :raw automatically pushed.
locale should have an two functions, locale::to_local and
locale::from_local which work identically to Encode:

en|de)code with
the appropriate encoding supplied.
Hmm, wonder what p5p's opinion on all that would be? "Go away, it's
working now, the right time to have said this was some time ago" would
certainly be fair enough...
Ben
[1] ...in the MIME sense, i.e. an encoding. I am aware of the
difference, it's just tiresome to be Correct all the time

.
[2] or EBCDIC, as appropriate, throughout.
[3] or rather, [abcd...xyz], to account for EBCDIC.