(Last night I've reread loads of perlunicode and friends, I feel much
better now) No, they are the same length *if* encoding of stream is set:
You posted the output of Devel:

eek:

ump, so I thought you were
talking about the *internal* representation.
How many bytes they occupy in an I/O stream depends on the encoding.
LATIN SMALL LETTER A WITH GRAVE is one byte in ISO-8859-1, CP850, ...
LATIN SMALL LETTER A WITH GRAVE is two bytes in UTF-8, UTF-16, ...
LATIN SMALL LETTER A WITH GRAVE is four bytes in UTF-32, ...
CYRILLIC SMALL LETTER A is one byte in ISO-8859-5, KOI-8, ...
CYRILLIC SMALL LETTER A is two bytes in UTF-8, UTF-16, ...
CYRILLIC SMALL LETTER A is four bytes in UTF-32, ...
(And of course, both characters cannot be represented at all in some
encodings: There is no LATIN SMALL LETTER A WITH GRAVE in ISO-8859-5,
and no CYRILLIC SMALL LETTER A in ISO-8859-1)
{7453:22} [0:0]% perl -CS -Mutf8 -wle 'print "Ã "' | xxd
0000000: c3a0 0a ...
{7459:23} [0:0]% perl -CS -Mutf8 -wle 'print "а"' | xxd
0000000: d0b0 0a ...
{7466:24} [0:0]%
But latin1 is special (I've reread perlunicode and friends), *if*
there's no reason (printing isn't reason) to upgrade to utf8 then
*characters* of latin1 script (and latin1 only) stay *bytes*:
I already explained that. When writing to a file handle, perl doesn't
care whether a string is composed of bytes or characters.
If the file handle has no :encoding() layer, it will try to write each
element of the string as a single byte.
If the file has an :encoding() layer, it will interpret each element of
the string as a character and convert that to a byte sequence according
to that encoding.
So without an encoding layer "\x{E0}" will always be written as the single byte
0xE0, regardless of whether the string is a byte string or a character
string. With an ":encoding(UTF-8)" layer it will always be written as
two bytes 0xC3 0xA0; and with an ":encoding(CP850)" layer, it will
always be written as a single byte 0x85.
What it apparently confusing you is what happens if that fails.
Obviously you can't write a single byte with the value 0x430, you can't
encode CYRILLIC SMALL LETTER A in ISO-8859-1 and you can't encode LATIN
SMALL LETTER A WITH GRAVE in ISO-8859-5.
So what does perl do? It prints a warning to STDERR and writes
a more or less reasonable approximation to the stream. The details
depend on the I/O layer:
If there is no :encoding() layer, the warning is "Wide character in
print" and the utf-8 representation is sent to the stream. And to
confuse matters further, this is done for the whole string, not just
this particular string element:
% perl -Mutf8 -E 'say "->\x{E0}\x{430}<-"'
Wide character in say at -e line 1.
->à а<-
(one string: \x{E0} and \x{430} converted to UTF-8)
% perl -Mutf8 -E 'say "->\x{E0}<-", "->\x{430}<-"'
Wide character in say at -e line 1.
->�<-->а<-
(two strings: \x{E0} printed as a single byte, \x{430} converted to UTF-8)
If there is an :encoding() layer, the warning is "\x{....} does not map
to $charset" and a \x{....} escape sequence is sent to the stream:
% perl -Mutf8 -E 'binmode STDOUT, ":encoding(iso-8859-5)"; say "->\x{E0}<-"'
"\x{00e0}" does not map to iso-8859-5 at -e line 1.
->\x{00e0}<-
But these are responses to an *error* condition. You shouldn't try to
write codepoints > 255 to a byte stream (actually, you shouldn't write
any characters to a byte stream, a byte stream is for bytes), and you
shouldn't try to write latin accented characters to a cyrillic stream.
Or at least you shouldn't be terribly surprised if the result is a
little confusing - garbage in, garbage out.
But even if encoding of stream isn't set concatenation with non-latin1
script upgrades latin1 too:
The term "upgrade" has a rather specific meaning in Perl in context with
byte and character strings, and I don't think you are talking about
that.
{7800:26} [0:0]% perl -Mutf8 -wle 'print "[à ][а]"' | xxd
Wide character in print at -e line 1.
0000000: 5bc3 a05d 5bd0 b05d 0a [..][..].
You have a single string "[à ][а]" here. As I wrote above, print treats
the string as unit and in the absence of an :encoding() layer just dumps
it in UTF-8 encoding. So, yes, both the "à " and the "а" within this
single string will be UTF-8-encoded (as will be the square brackets, but
for them the UTF-8 encoding is the same as for US-ASCII, so you don't
notice that).
And I repeat it again: You are doing something which just doesn't make
sense (writing characters to a byte stream), so don't be surprised if
the result is a little surprising. Do it right and the result will make
sense.
Please rewind the thread. That's exactly what happened couple of posts
ago (specifically: <
[email protected]> and
<
[email protected]>).
I've read these postings but I don't know what you are referring to. If
you are referring to other postings (especially long ones), please cite
the relevant part.
{9829:45} [0:0]% perl -Mutf8 -MDevel:

eek -wle '$aa = "aà а" ; Dump $aa'
SV = PV(0xa06f750) at 0xa08afac
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0xa086a08 "a\303\240\320\260"\0 [UTF8 "a\x{e0}\x{430}"]
CUR = 5
LEN = 12
*Characters* of latin1 aren't wide (even if they are characters, they
are still one byte long)
In UTF-8, latin-1 characters >= 0x80 are 2 bytes, the same as cyrillic
characters. Your example shows this: "Ã " (LATIN SMALL LETTER A WITH
GRAVE) is "\303\240" and "а" (CYRILLIC SMALL LETTER A) is "\320\260".
No. Because it's not UTF-8, it's utf8.
I presume that by "utf8" you mean a string with the UTF8 bit set
(testable with the utf8::is_utf8() function). But as I've written
repeatedly, this is completely irrelevant for I/O. A string will be
treated completely identical, whether is has this bit set or not. It is
only the value of the string which is important, not its internal type
and representation.
(Also, I find it very confusing that you post the output of
Devel:

eek:

ump, but then apparently don't refer to it but talk about
something else. Please try to organize your postings in a way that one
can understand what you are talking about. It is very likely that this
exercise will also clear up the confusion in your mind)
As long as utf8 semantics isn't set, anything scalar stays plain
bytes:
{2786:10} [0:0]% perl -MDevel:

eek -wle 'Dump "Ã "'
SV = PV(0x9d0e878) at 0x9d29f28
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x9d2ddc8 "\303\240"\0
CUR = 2
LEN = 12
However, when utf8 semantics is set, then those codepoints that fit
latin1 script become special Perl-latin1:
{5930:11} [0:0]% perl -MDevel:

eek -Mutf8 -wle 'Dump "Ã "'
SV = PV(0x9b92880) at 0x9badf10
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x9bb1eb0 "\303\240"\0 [UTF8 "\x{e0}"]
CUR = 2
LEN = 12
Yes. We've been through that. Ben explained it in excruciating detail.
What don't you understand here?
Mine are us-ascii, I have open.pm for rest.
US-ASCII is a subset of UTF-8, so your files are UTF-8, too ;-). (Most
of mine don't contain non-ASCII characters either) What I meant is that
I don't use any other encoding (like ISO-8859-1 or ISO-8859-15) to
encode non-ASCII characters, so I don't have any need for "use
encoding". If your scripts are all in ASCII and you use open.pm for
"rest", what do you need "use encoding" for? Remember, this subthread
started when you berated Ben for discouraging the use "use encoding".
hp