Is the pod of Encode::MIME::Header giving wrong advice?


G

G.B.

Hi,

From the following lines in the docs of Encode::MIME::Header,

$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);

one might be tempted to infer that $utf8 stands for UTF-8
encoded "text", i.e. bytes.

Apparently, it doesn't.

Proof: calling Encode::encode('MIME-Header', $perlstring), i.e.
not passing some UTF-8 encoded bytes, but passing a regular
character string instead, perl prints, as expected:

$ perl -e 'use Encode;
my $perlstring = "A \x{20AC} is worth \$1.35";
print STDOUT Encode::encode("MIME-Q", $perlstring), "\n";
'
=?UTF-8?Q?A=20=E2=82=AC=20is=20worth=20=241=2E35?=

The triple E2 82 AC is the UTF-8 triple of '€', as expected. QED.

Given the "utf8 flag fallacy", if you'll allow me to call it
that, do the above two lines from the pod really give a good hint?

TIA, Georg
 
Ad

Advertisements

T

Thomas 'PointedEars' Lahn

G.B. wrote:
^^^^
Please fix. And something has gone wrong with MIME-encode()ing your “Fromâ€
header field value, too, because I do not see an address there.
From the following lines in the docs of Encode::MIME::Header,

$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);

one might be tempted to infer that $utf8 stands for UTF-8
encoded "text", i.e. bytes.

*Any* text is encoded in bytes. UTF-8 just requires more than one 8-bit
byte for Unicode characters whose code point value is 0x7F.

Apparently, it doesn't.

Proof: calling Encode::encode('MIME-Header', $perlstring), i.e.
not passing some UTF-8 encoded bytes, but passing a regular
character string instead, perl prints, as expected:

$ perl -e 'use Encode;
my $perlstring = "A \x{20AC} is worth \$1.35";
print STDOUT Encode::encode("MIME-Q", $perlstring), "\n";
'
=?UTF-8?Q?A=20=E2=82=AC=20is=20worth=20=241=2E35?=

The triple E2 82 AC is the UTF-8 triple of '€', as expected. QED.

Given the "utf8 flag fallacy", if you'll allow me to call it
that, do the above two lines from the pod really give a good hint?

What are you talking about?

$ perl -We 'my $perlstring = "A \x{20AC} is worth \$1.35"; print
$perlstring, "\n";' | od -ctx1
Wide character in print at -e line 1.
0000000 A 342 202 254 i s w o r t h $
41 20 e2 82 ac 20 69 73 20 77 6f 72 74 68 20 24
0000020 1 . 3 5 \n
31 2e 33 35 0a
0000025

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for i486-linux-gnu-
thread-multi-64int
(with 40 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

$ locale
LANG=de_CH.UTF-8
LANGUAGE=
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=
 
G

G.B.

*Any* text is encoded in bytes.

That's tautologically true by definition, though not addressing
the issue.

The line addressing it was

Encode::encode("MIME-Header", "... \x{20AC} ...")

The line, that is, which indicates that the parameter, here written

"... \x{20AC} ..."

must *not* have been returned from Encode::encode("UTF-8", ...)
*even* *though* the documentation names the parameter "$utf8".
(Which is ubiquitously implying "bytes, not text" in Perl/Python
/Ruby, to users of equally ubiquitous packages requiring
one or the other.)

With all the discussion in Perl's docs related to Unicode, it seems
pretty obvious that the variable $utf8 in Encode::MIME::Header's docs
is expressing an expectation. Viz., that the programmer is expected
to understand that the thing named "$utf8" is so named for a reason.
That it is to be of a certain kind.
That kind has nothing to do with explicit UTF-8 encoding, however,
that's the issue. The issue is the expectation that the name "$utf8"
is generating. (Add code points in the Latin-1 range, source encoding,
web form data, data from SQL-tables, stackoverflo questions about
encoding etc., then you get the overly rich set of possible
nterpretations.)


__
Please fix USENET caused spam, etc. first, if simple inference
of real names from obviously available information is asking
too much of you, Spock, which I cannot imagine to be the case!
Georg.
 
T

Thomas 'PointedEars' Lahn

G.B. said:
That's tautologically true by definition, though not addressing
the issue.

You talked nonsense about bytes *vs.* text; that required correction.
The line addressing it was

Encode::encode("MIME-Header", "... \x{20AC} ...")

The line, that is, which indicates that the parameter, here written

"... \x{20AC} ..."

must *not* have been returned from Encode::encode("UTF-8", ...)

The code that you posted does not contain either call. It contained

which had printed, by your account and in my tests,

which is (basically) the Quoted-Printable representation of the string with
UTF-8 as its “charset†value (RFC 2047, §4.2). It makes sense to use UTF-8
for the “charset†here because Unicode is AFAIK the only character set
featuring that code point value. Also, UTF-8 provides, at average, the
shortest possible encoding for Unicode characters [1] – here: three 8-bit
bytes – and is well supported by user agents. So apparently it works as
designed. (It appears to be independent of the locale, too.)

(Which is ubiquitously implying "bytes, not text" in Perl/Python
/Ruby, to users of equally ubiquitous packages requiring
one or the other.)

Utter nonsense.
[…]
__
Please fix USENET caused spam, etc. first,

BTDT. When will you start?
if simple inference of real names from obviously available information

An e-mail address, if it is an address (the thing in your From header field
value is _not_, because “a mailbox *receives* mailâ€; see RFC 5322, §3.4.1,
referred by RFC 5536, §3.1.2) is just an *address*. Inferring from that the
real name and other personal information is fallacious, as you can see e. g.
in my From header field value. Likewise for domain names.
is asking too much of you, Spock, which I cannot imagine to be the case!
Georg.

You are demanding diligence, even fallacious inference, in communication
from others (the POD writers, me), but you yourself do not even care to
observe basic rules of communication. Hypocrite.

[en] <http://www.interhack.net/pubs/munging-harmful/>
[de] <http://www.gerlo.de/falsche-email-adressen.html>
[de] <http://www.arcor.de/rd/doc.agb_vf>

You have been warned.

*PLONK*
 
Ad

Advertisements

G

G.B.

On 24.04.14 12:21, Thomas 'PointedEars' Lahn wrote:

$header = encode('MIME-Header', $utf8);
which had printed, by your account and in my tests,

which is (basically) the Quoted-Printable representation of the string with
UTF-8 as its “charset†value (RFC 2047, §4.2).

It's about how the input variable "$utf8" is misnamed,
thus misleading.

For bytes vs. strings, I'm referring to

$ man perlunicode
$ man MIME::Base64
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top