Is the pod of Encode::MIME::Header giving wrong advice?

G.B. · Apr 23, 2014

Hi,

From the following lines in the docs of Encode::MIME::Header,

$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);

one might be tempted to infer that $utf8 stands for UTF-8
encoded "text", i.e. bytes.

Apparently, it doesn't.

Proof: calling Encode::encode('MIME-Header', $perlstring), i.e.
not passing some UTF-8 encoded bytes, but passing a regular
character string instead, perl prints, as expected:

$ perl -e 'use Encode;
my $perlstring = "A \x{20AC} is worth \$1.35";
print STDOUT Encode::encode("MIME-Q", $perlstring), "\n";
'
=?UTF-8?Q?A=20=E2=82=AC=20is=20worth=20=241=2E35?=

The triple E2 82 AC is the UTF-8 triple of 'â‚¬', as expected. QED.

Given the "utf8 flag fallacy", if you'll allow me to call it
that, do the above two lines from the pod really give a good hint?

TIA, Georg

Thomas 'PointedEars' Lahn · Apr 23, 2014

G.B. wrote:
^^^^
Please fix. And something has gone wrong with MIME-encode()ing your â€œFromâ€
header field value, too, because I do not see an address there.

From the following lines in the docs of Encode::MIME::Header,

$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);

one might be tempted to infer that $utf8 stands for UTF-8
encoded "text", i.e. bytes.

*Any* text is encoded in bytes. UTF-8 just requires more than one 8-bit
byte for Unicode characters whose code point value is 0x7F.

Apparently, it doesn't.

Proof: calling Encode::encode('MIME-Header', $perlstring), i.e.
not passing some UTF-8 encoded bytes, but passing a regular
character string instead, perl prints, as expected:

$ perl -e 'use Encode;
my $perlstring = "A \x{20AC} is worth \$1.35";
print STDOUT Encode::encode("MIME-Q", $perlstring), "\n";
'
=?UTF-8?Q?A=20=E2=82=AC=20is=20worth=20=241=2E35?=

The triple E2 82 AC is the UTF-8 triple of 'â‚¬', as expected. QED.

Given the "utf8 flag fallacy", if you'll allow me to call it
that, do the above two lines from the pod really give a good hint?

What are you talking about?

$ perl -We 'my $perlstring = "A \x{20AC} is worth \$1.35"; print
$perlstring, "\n";' | od -ctx1
Wide character in print at -e line 1.
0000000 A 342 202 254 i s w o r t h $
41 20 e2 82 ac 20 69 73 20 77 6f 72 74 68 20 24
0000020 1 . 3 5 \n
31 2e 33 35 0a
0000025

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for i486-linux-gnu-
thread-multi-64int
(with 40 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or
the GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

$ locale
LANG=de_CH.UTF-8
LANGUAGE=
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=

Thomas 'PointedEars' Lahn · Apr 23, 2014

Thomas said:
*Any* text is encoded in bytes. UTF-8 just requires more than one 8-bit
byte for Unicode characters whose code point value is 0x7F. ^ above
<http://unicode.org/faq/>

G.B. · Apr 24, 2014

*Any* text is encoded in bytes.

That's tautologically true by definition, though not addressing
the issue.

The line addressing it was

Encode::encode("MIME-Header", "... \x{20AC} ...")

The line, that is, which indicates that the parameter, here written

"... \x{20AC} ..."

must *not* have been returned from Encode::encode("UTF-8", ...)
*even* *though* the documentation names the parameter "$utf8".
(Which is ubiquitously implying "bytes, not text" in Perl/Python
/Ruby, to users of equally ubiquitous packages requiring
one or the other.)

With all the discussion in Perl's docs related to Unicode, it seems
pretty obvious that the variable $utf8 in Encode::MIME::Header's docs
is expressing an expectation. Viz., that the programmer is expected
to understand that the thing named "$utf8" is so named for a reason.
That it is to be of a certain kind.
That kind has nothing to do with explicit UTF-8 encoding, however,
that's the issue. The issue is the expectation that the name "$utf8"
is generating. (Add code points in the Latin-1 range, source encoding,
web form data, data from SQL-tables, stackoverflo questions about
encoding etc., then you get the overly rich set of possible
nterpretations.)

__
Please fix USENET caused spam, etc. first, if simple inference
of real names from obviously available information is asking
too much of you, Spock, which I cannot imagine to be the case!
Georg.

Thomas 'PointedEars' Lahn · Apr 24, 2014

G.B. said:
That's tautologically true by definition, though not addressing
the issue.

You talked nonsense about bytes *vs.* text; that required correction.

The line addressing it was

Encode::encode("MIME-Header", "... \x{20AC} ...")

The line, that is, which indicates that the parameter, here written

"... \x{20AC} ..."

must *not* have been returned from Encode::encode("UTF-8", ...)

The code that you posted does not contain either call. It contained

which had printed, by your account and in my tests,

which is (basically) the Quoted-Printable representation of the string with
UTF-8 as its â€œcharsetâ€ value (RFC 2047, Â§4.2). It makes sense to use UTF-8
for the â€œcharsetâ€ here because Unicode is AFAIK the only character set
featuring that code point value. Also, UTF-8 provides, at average, the
shortest possible encoding for Unicode characters [1] â€“ here: three 8-bit
bytes â€“ and is well supported by user agents. So apparently it works as
designed. (It appears to be independent of the locale, too.)

(Which is ubiquitously implying "bytes, not text" in Perl/Python
/Ruby, to users of equally ubiquitous packages requiring
one or the other.)

Utter nonsense.

[â€¦]
__
Please fix USENET caused spam, etc. first,

BTDT. When will you start?

if simple inference of real names from obviously available information

An e-mail address, if it is an address (the thing in your From header field
value is _not_, because â€œa mailbox *receives* mailâ€; see RFC 5322, Â§3.4.1,
referred by RFC 5536, Â§3.1.2) is just an *address*. Inferring from that the
real name and other personal information is fallacious, as you can see e.â€¯g.
in my From header field value. Likewise for domain names.

is asking too much of you, Spock, which I cannot imagine to be the case!
Georg.

You are demanding diligence, even fallacious inference, in communication
from others (the POD writers, me), but you yourself do not even care to
observe basic rules of communication. Hypocrite.

[en] <http://www.interhack.net/pubs/munging-harmful/>
[de] <http://www.gerlo.de/falsche-email-adressen.html>
[de] <http://www.arcor.de/rd/doc.agb_vf>

You have been warned.

*PLONK*

G.B. · Apr 24, 2014

On 24.04.14 12:21, Thomas 'PointedEars' Lahn wrote:

$header = encode('MIME-Header', $utf8);

which had printed, by your account and in my tests,

which is (basically) the Quoted-Printable representation of the string with
UTF-8 as its â€œcharsetâ€ value (RFC 2047, Â§4.2).

It's about how the input variable "$utf8" is misnamed,
thus misleading.

For bytes vs. strings, I'm referring to

$ man perlunicode
$ man MIME::Base64

generate and send mail with python: tutorial	8	Aug 11, 2011
No-syntax Web-programming-IDE (was: Does turtle graphics have the wrong associations?)	0	Nov 22, 2009
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
REQ: Perl 5.8.3 on OpenBSD	3	Mar 6, 2004
comp.lang.c FAQ list Table of Contents	0	Jul 1, 2003
Can't make this page work	6	Mar 8, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 3, 2004

Is the pod of Encode::MIME::Header giving wrong advice?

G.B.

Thomas 'PointedEars' Lahn

Thomas 'PointedEars' Lahn

G.B.

Thomas 'PointedEars' Lahn

G.B.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads