how is the string encoded

dn.perl · Jan 3, 2012

I know the question must have been asked many times, there are many
web-pages which are supposed to help, but after going through many of
them, I still need help.
I am running a simple program on linux, perl 5.8.8 ;

use strict ;
use warnings ;
## use utf8 ;

my $str ;
$str = "ä" ;
print "str is $str\n" ;
---
Works well. But my question is: how do I know which encoding is being
used to read/write $str?

If I uncomment 'use utf8' like, I get a warning: Malformed UTF-8
character. And the string no longer prints correct. Why, and how to
remove this warning, and print the string correctly? I should have
guesses that 'use utf8' adds more power to the code and would not stop
running code which was otherwise running correct.

Rainer Weikusat · Jan 3, 2012

[email protected] said:
I know the question must have been asked many times, there are many
web-pages which are supposed to help, but after going through many of
them, I still need help.
I am running a simple program on linux, perl 5.8.8 ;

use strict ;
use warnings ;
## use utf8 ;

my $str ;
$str = "ä" ;
print "str is $str\n" ;

According to the people who dabble in this area, you are not supposed
to know that. You are supposed to convert any data flowing into perl
from the encoding known to you into 'the super-secret, proprietary
internal Perl encoding' (patent pending) and any data flowing out of
perl from said 'super-secret internal Perl encoding' into whatever
encoding you'd like to have. Should the encoding you want to use (for
whatever reason) not be among the ones Perl supports natively, you're
fucked and advised to take your petty problems elsewhere. That's the
theory. Practically, Perl uses utf8 (which presumably cause a lot of
people sour bumpers because Microsoft [reportedly] uses UCS-2).

Another practical piece of advice: Stick to ASCII. That's the only
thing no American comittee is going to uninvent tomorrow and thus, a
safe choice for all communication needs among educated people. Let all
those club-bearing natives draw their weird krikel-krakels to their
hearts content and ignore them.

Helmut Richter · Jan 3, 2012

According to the people who dabble in this area, you are not supposed
to know that. You are supposed to convert any data flowing into perl
from the encoding known to you into 'the super-secret, proprietary
internal Perl encoding' (patent pending) and any data flowing out of
perl from said 'super-secret internal Perl encoding' into whatever
encoding you'd like to have.

You have to make a difference between the encoding used by *you* while you
are writing your perl program, and the encoding used by *perl* while it is
running your program.

You have to know what *you* are using. The answer has nothing to do with
perl. If you can look at your program in an environment where UTF-8 is
expected and you read it correctly there, then the program is in UTF-8.
Use the "use utf8" to tell perl about it. It has no effect on what the
program does with the strings in it.

For the encoding used by *perl* while it is running your program, Rainer
Weikusat's comment applies. You should not try to know. As long as all
characters in a string are in the ISO-8859-1 character set, it is probable
that ISO-8859-1 is internally used; there is an additional flag in the
internal representation to indicate how the string is internally stored.
Don't mess around with the internal encoding. Rather, *you* have to know
how you meant the string: either as sequence of bytes whose character
meaning only you know, or as a sequence of characters whose encoding as
bytes only perl knows. Do not try to share such knowledge between perl and
you. This is fairly well explained in perlunitut (e.g.
http://search.cpan.org/~flora/perl-5.14.2/pod/perlunitut.pod).

Rainer Weikusat · Jan 3, 2012

Helmut Richter said:
You have to make a difference between the encoding used by *you* while you
are writing your perl program, and the encoding used by *perl* while it is
running your program.

No. The people who *presently* work on Perl unicode support *want*
that users of the language have to pretend that 'the internal perl
encoding' is some magic secret beyond the realm of Perl code *despite*
this is obviously at odds with the original design of 'unicode support
for Perl' and this doesn't make much sense: At the very least, this
requires one additional copy of all data flowing into Perl and one
additional copy of all data going out of Perl. Given that one of the
main uses of Perl is as a so-called 'glue language' interconnection
other pieces of software into a complex whole, this is a major pain in
the ass and this solely for the hypothetical benefit of the people
working on the code. It is hypothetical because there is no way in
heaven or hell that all of the existing Perl code which wasn't written
based on the assumption that Perl strings are magic beasts with
intransigent properties is ever going to be changed just because this
would appeal someone's completely impractical idea of theoretical
purity and the worst possible cause is that - someday - a Perl 5 fork
is created which does break all this code and this will then simply
become Perl 6 rev 0.5 --- something which exists for the private joy
of its developers nobody uses for anything.

Rainer Weikusat · Jan 3, 2012

Ben Morrow said:
That perl is very nearly six years old. You should upgrade to at least
5.12.

If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
you do, it assumes your source is in UTF-8. (In theory you can use other
encodings with the 'use encoding' pragma, but AIUI this doesn't work
reliably.)

Output is completely unrelated. If you don't do anything special, perl
will give you output in ISO8859-1.

This isn't quite correct: It will use 'the native 8 bit encoding' and
this may well be something other than ASCII/ ISO-8859-1, although
that's a case which rarely occurs in practice because most people
don't write code for IBM mainframes :->.

[...]

If you attempt to print a character which can't be represented in
ISO8859-1 you get a warning and the raw UTF-8 bytes representing
that character: this is obviously something you need to avoid, since
the output doesn't make any sense at that point.

An example I recently encountered where it did make sense was a web
interface with a Japanese localization: Since there were no characters
corresponding with codepoints from (128, 255), the generated output
was simply UTF-8 encoded Japanese which was exactly what it was
supposed to be.

dn.perl · Jan 4, 2012

That perl (5.8.8) is very nearly six years old. You should
upgrade to at least 5.12.

I wonder whether you realize how difficult (ranging to impossible) it
may be to achieve it. Say, I am on a 3-month contract. The employer
has been managing for years with 5.8.8 and is unlikely to upgrade in
such a case. Once I was stuck with a MySQL server which was many years
old, but my boss was more concerned with preserving his own job than
asking his BOSS to spend time and money on upgrading. Not that the
suggestion to upgrade is wrong or any thing.

If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
you do, it assumes your source is in UTF-8. (In theory you can use other
encodings with the 'use encoding' pragma, but AIUI this doesn't work
reliably.)
...
What did you expect to happen? perldoc utf8 quite clearly says
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.
so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
must expect warnings and misbehaviour.

It is very useful to know that perl assumes the source to be
ISO8859-1. That 'use utf8' arguably works counter-intuitively. Since
my code is ASCII and all ASCII is automatically utf8, I tend to wonder
why I would ever write non-ascii code. It may not be a logical thing
to do but I daresay it is an instinctive thing to do. Now if I want to
dabble in utf8 or databases, what do I do? I think of 'use utf8' or
'use DataBaseInterface DBI'.

What I needed was 'use Encode' which is what I am doing now.
Thanks for all the responses.

Peter J. Holzer · Jan 4, 2012

That perl (5.8.8) is very nearly six years old. You should
upgrade to at least 5.12. [...]
If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
you do, it assumes your source is in UTF-8. (In theory you can use other
encodings with the 'use encoding' pragma, but AIUI this doesn't work
reliably.)
...
What did you expect to happen? perldoc utf8 quite clearly says
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.
so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
must expect warnings and misbehaviour.

Click to expand...

It is very useful to know that perl assumes the source to be
ISO8859-1.

This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1. The character codes
are the same, but the semantics are different. For example, if your
script was encoded in ISO-8859-1, "ä" would result in string consisting
of a single byte with the value 0xE4, but that byte is not equivalent to
the character "ä" - it doesn't match \w, [:lower:] or any of the other
classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
uppercased. It is just a meaningless byte, not a character.

That 'use utf8' arguably works counter-intuitively. Since
my code is ASCII

No, your code isn't ASCII. It contained the line

| $str = "ä" ;

"ä" is not an ASCII character.

and all ASCII is automatically utf8, I tend to wonder
why I would ever write non-ascii code.

Well, why did you?

What I needed was 'use Encode' which is what I am doing now.

Please don't unless you really understand what it does. Encode does a
couple of different things and it isn't entirely consistent. It seemed
like a good idea at the time and it may have been useful for converting
pre-5.8-code, but I really wouldn't use it for new code.

hp

Rainer Weikusat · Jan 4, 2012

Peter J. Holzer said:
That perl (5.8.8) is very nearly six years old. You should
upgrade to at least 5.12. [...]
If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
you do, it assumes your source is in UTF-8. (In theory you can use other
encodings with the 'use encoding' pragma, but AIUI this doesn't work
reliably.)
...
What did you expect to happen? perldoc utf8 quite clearly says
Do not use this pragma for anything else than telling Perl that your
script is written in UTF-8.
so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
must expect warnings and misbehaviour.

Click to expand...

It is very useful to know that perl assumes the source to be
ISO8859-1.

Click to expand...

This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1.
The character codes are the same, but the semantics are different.

This is also not quite correct: When 'use locale' is in effect, Perl
assumes that anything beyond ASCII is supposed to have a meaning in
the locale which happens to be in effect when the script is
executed. Otherwise, the default is equivalent to the default POSIX
locale (corresponding with LANG=C) which means bytes with value in the
range (0, 127) will be interpreted as ASCII characters belonging to
some of the different characters classes and bytes with values from
(128, 255) are just 'bytes with certain values' and no further
properties.

Eg, assuming the text included below

----------------
$a = chr(0xe4);

{
use locale;
print 'locale: ', $a =~ /\w/, "\n";
}

print 'no locale: ', $a =~ /\w/, "\n";
----------------

is saved to a file on a system where locale-information for ISO-8859-1
based German is available, the command (a.pl being the name of the
file)

LANG=de_DE perl a.pl

will print

locale: 1
no locale:

and

LANG=C perl a.pl

locale:
no locale:

Rainer Weikusat · Jan 5, 2012

Ben Morrow said:
Quoth "Peter J. Holzer said:

On 2012-01-04 07:38, (e-mail address removed) <[email protected]> wrote:

Click to expand...

[...]

Please don't unless you really understand what it does. Encode does a
couple of different things and it isn't entirely consistent. It seemed
like a good idea at the time and it may have been useful for converting
pre-5.8-code, but I really wouldn't use it for new code.

Click to expand...

Are you (either of you, in fact) thinking of 'use encoding'? That pragma
is, as I said originally, a Bad Idea.

This would then be another documented Perl which managed to run afoul
of someone's opinions. Is their actually any other reason than "it's a
convenient way to do what shalt not be done"?

Rainer Weikusat · Jan 5, 2012

Shmuel (Seymour J.) Metz said:
No.

It's documented:

[rw@sapphire]/tmp $whatis encoding
encoding (3perl) - allows you to write your script in non-ascii or non-utf8

But according to the opinion of someone, it shouldn't be used.

Yes.

And - as usual - no reasons beyond 'thou shalt do as I bid you and not
ask silly questions' are given.

Peter J. Holzer · Jan 5, 2012

Quoth "Peter J. Holzer said:
Quoth "Peter J. Holzer said:

It is very useful to know that perl assumes the source to be
ISO8859-1.

Click to expand...

This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1. The character codes
are the same, but the semantics are different. For example, if your
script was encoded in ISO-8859-1, "ä" would result in string consisting
of a single byte with the value 0xE4, but that byte is not equivalent to
the character "ä" - it doesn't match \w, [:lower:] or any of the other
classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
uppercased. It is just a meaningless byte, not a character.

Click to expand...

Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
string doesn't match \w, but as soon as you do anything that causes it
to be upgraded, it will.

Yup, but unless you do something which causes it to be upgraded, it
won't. So if you care about it being ISO-8859-1, you have to either
force an upgrade or decode it. So I prefer to think of it as an
"unspecified superset of ASCII" and not as "almost but not quite
ISO-8859-1".

Are you (either of you, in fact) thinking of 'use encoding'?

Yes, sorry. I misread what (e-mail address removed) wrote.

That pragma is, as I said originally, a Bad Idea. Encode, OTOH, is
perfectly reliable, and cannot be avoided if you want to use data in
any encoding other than UTF-8.

Right. I use that quite frequently, actually.

hp

Rainer Weikusat · Jan 5, 2012

Peter J. Holzer said:
Quoth "Peter J. Holzer said:

It is very useful to know that perl assumes the source to be
ISO8859-1.

This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1. The character codes
are the same, but the semantics are different. For example, if your
script was encoded in ISO-8859-1, "ä" would result in string consisting
of a single byte with the value 0xE4, but that byte is not equivalent to
the character "ä" - it doesn't match \w, [:lower:] or any of the other
classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
uppercased. It is just a meaningless byte, not a character.

Click to expand...

Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
string doesn't match \w, but as soon as you do anything that causes it
to be upgraded, it will.

Click to expand...

Yup, but unless you do something which causes it to be upgraded, it
won't. So if you care about it being ISO-8859-1, you have to either
force an upgrade or decode it. So I prefer to think of it as an
"unspecified superset of ASCII"

Assuming that locale information isn't being used, it is ASCII and not
'a superset of ASCII' since no byte value outside the subset of
possible byte values used by the ASCII encoding has any 'character
properties' (except being considered 'a non-word character', that is).

Peter J. Holzer · Jan 6, 2012

Peter J. Holzer said:
Peter J. Holzer said:

Quoth "Peter J. Holzer" <[email protected]>:

It is very useful to know that perl assumes the source to be
ISO8859-1.

This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1. The character codes
are the same, but the semantics are different. [...]
Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
string doesn't match \w, but as soon as you do anything that causes it
to be upgraded, it will.

Click to expand...

Yup, but unless you do something which causes it to be upgraded, it
won't. So if you care about it being ISO-8859-1, you have to either
force an upgrade or decode it. So I prefer to think of it as an
"unspecified superset of ASCII"

Click to expand...

Assuming that locale information isn't being used, it is ASCII and not
'a superset of ASCII' since no byte value outside the subset of
possible byte values used by the ASCII encoding has any 'character
properties' (except being considered 'a non-word character', that is).

ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
isn't ASCII any more.

hp

Rainer Weikusat · Jan 6, 2012

Peter J. Holzer said:
Peter J. Holzer said:

Quoth "Peter J. Holzer" <[email protected]>:

It is very useful to know that perl assumes the source to be
ISO8859-1.

This is not quite correct. Without 'use utf8', perl assumes your source
is an unspecified superset of ASCII, not ISO-8859-1. The character codes
are the same, but the semantics are different. [...]
Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
string doesn't match \w, but as soon as you do anything that causes it
to be upgraded, it will.

Yup, but unless you do something which causes it to be upgraded, it
won't. So if you care about it being ISO-8859-1, you have to either
force an upgrade or decode it. So I prefer to think of it as an
"unspecified superset of ASCII"

Click to expand...

Assuming that locale information isn't being used, it is ASCII and not
'a superset of ASCII' since no byte value outside the subset of
possible byte values used by the ASCII encoding has any 'character
properties' (except being considered 'a non-word character', that is).

Click to expand...

ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
isn't ASCII any more.

As soon as a byte with value > 127 is considered to be some character,
it isn't ASCII anymore but a superset of ASCII.

Rainer Weikusat · Jan 6, 2012

Shmuel (Seymour J.) Metz said:
What's documented the feature or the claim that the only reason to
not use it is because it ran afoul of someone's opinion.

Try an educated guess based on the content of the text I wrote. I'm
giving you a gratis hint: I quoted the 'name' section of the encoding
manual page.

[...]

Well, the opinions of the people who tried to fix the bugs.

The mere fact that somebody failed at doing something ('tried to fix a
bug') doesn't really make that someone authoritative on anything.

And, as usual, you are inventing claims that nobody actually made.

That claim was implicit in your refusal to give a reason.

Peter J. Holzer · Jan 6, 2012

As soon as a byte with value > 127 is considered to be some character,
it isn't ASCII anymore but a superset of ASCII.

Glad you agree.

hp

Rainer Weikusat · Jan 7, 2012

Peter J. Holzer said:
Glad you agree.

I don't: Since bytes with values > 127 are not considered to be
characters, it doesn't make sense to refer to this as 'superset of
ASCII': It would need some character outside of the ASCII range, not
just numbers which can also be stored in bytes because of the
hardware.

Kaz Kylheku · Jan 7, 2012

I don't: Since bytes with values > 127 are not considered to be
characters

That's not grounds to disagree. The union of the set of all bicycles and the
set of ASCII characters is a superset of the set of ASCII characters.

Also, the set of ASCII characters is a superset of the set of ASCII
characters (albeit not a proper superset).

Ted Zlatanov · Jan 7, 2012

BM> I am not authoritative on anything. I have never claimed to be. I am
BM> attempting to convey what I believe was the consensus on p5p, in the
BM> hope that people here might find the information useful.

Your information was useful and practical, thanks.

BM> You are free to ignore me. I, and probably everyone else, would very
BM> much rather you did.

Yeah, killfiling Rainer is not enough. Get into his killfile like I did.

Ted

Rainer Weikusat · Jan 8, 2012

Shmuel (Seymour J.) Metz said:
I did; you don't seem to like my educated guess.

I think you came up with a rather idiotic supposition based on your
somewhat lacking 'reading comprehension' skills, your desire to attack
me in any case and your complete inability (or unwillingness) to come
up with something like a rational counterargument.

I suggest that you consider discussing the issue with an interested
parking meter.

form post URL encoded	4	Jun 26, 2013
Problem Splitting Text String	2	Dec 29, 2022
Is the pod of Encode::MIME::Header giving wrong advice?	5	Apr 23, 2014
Help with code	2	Oct 11, 2022
How to use the last user input in Text widget in an If/Else statement after Enter is pressed?	0	Mar 31, 2019
LibXML element->toString vs document->toString	5	Jul 12, 2012
How to print prefix and suffix without giving a String as an argument between them	2	May 9, 2022
utf8 url encoded letters and their %values	4	Nov 17, 2009

how is the string encoded

dn.perl

Rainer Weikusat

Helmut Richter

Rainer Weikusat

Rainer Weikusat

dn.perl

Peter J. Holzer

Rainer Weikusat

Rainer Weikusat

Rainer Weikusat

Peter J. Holzer

Rainer Weikusat

Peter J. Holzer

Rainer Weikusat

Rainer Weikusat

Peter J. Holzer

Rainer Weikusat

Kaz Kylheku

Ted Zlatanov

Rainer Weikusat

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads