Is the pod of Encode::MIME::Header giving wrong advice?

Discussion in 'Perl Misc' started by G.B., Apr 23, 2014.

  1. G.B.

    G.B. Guest

    Hi,

    From the following lines in the docs of Encode::MIME::Header,

    $utf8 = decode('MIME-Header', $header);
    $header = encode('MIME-Header', $utf8);

    one might be tempted to infer that $utf8 stands for UTF-8
    encoded "text", i.e. bytes.

    Apparently, it doesn't.

    Proof: calling Encode::encode('MIME-Header', $perlstring), i.e.
    not passing some UTF-8 encoded bytes, but passing a regular
    character string instead, perl prints, as expected:

    $ perl -e 'use Encode;
    my $perlstring = "A \x{20AC} is worth \$1.35";
    print STDOUT Encode::encode("MIME-Q", $perlstring), "\n";
    '
    =?UTF-8?Q?A=20=E2=82=AC=20is=20worth=20=241=2E35?=

    The triple E2 82 AC is the UTF-8 triple of '€', as expected. QED.

    Given the "utf8 flag fallacy", if you'll allow me to call it
    that, do the above two lines from the pod really give a good hint?

    TIA, Georg
     
    G.B., Apr 23, 2014
    #1
    1. Advertisements

  2. G.B. wrote:
    ^^^^
    Please fix. And something has gone wrong with MIME-encode()ing your “Fromâ€
    header field value, too, because I do not see an address there.
    *Any* text is encoded in bytes. UTF-8 just requires more than one 8-bit
    byte for Unicode characters whose code point value is 0x7F.

    What are you talking about?

    $ perl -We 'my $perlstring = "A \x{20AC} is worth \$1.35"; print
    $perlstring, "\n";' | od -ctx1
    Wide character in print at -e line 1.
    0000000 A 342 202 254 i s w o r t h $
    41 20 e2 82 ac 20 69 73 20 77 6f 72 74 68 20 24
    0000020 1 . 3 5 \n
    31 2e 33 35 0a
    0000025

    $ perl -v

    This is perl 5, version 18, subversion 2 (v5.18.2) built for i486-linux-gnu-
    thread-multi-64int
    (with 40 registered patches, see perl -V for more detail)

    Copyright 1987-2013, Larry Wall

    Perl may be copied only under the terms of either the Artistic License or
    the GNU General Public License, which may be found in the Perl 5 source kit.

    Complete documentation for Perl, including FAQ lists, should be found on
    this system using "man perl" or "perldoc perl". If you have access to the
    Internet, point your browser at http://www.perl.org/, the Perl Home Page.

    $ locale
    LANG=de_CH.UTF-8
    LANGUAGE=
    LC_CTYPE="de_CH.UTF-8"
    LC_NUMERIC="de_CH.UTF-8"
    LC_TIME="de_CH.UTF-8"
    LC_COLLATE="de_CH.UTF-8"
    LC_MONETARY="de_CH.UTF-8"
    LC_MESSAGES=en_US.UTF-8
    LC_PAPER="de_CH.UTF-8"
    LC_NAME="de_CH.UTF-8"
    LC_ADDRESS="de_CH.UTF-8"
    LC_TELEPHONE="de_CH.UTF-8"
    LC_MEASUREMENT="de_CH.UTF-8"
    LC_IDENTIFICATION="de_CH.UTF-8"
    LC_ALL=
     
    Thomas 'PointedEars' Lahn, Apr 23, 2014
    #2
    1. Advertisements

  3.  
    Thomas 'PointedEars' Lahn, Apr 23, 2014
    #3
  4. G.B.

    G.B. Guest

    That's tautologically true by definition, though not addressing
    the issue.

    The line addressing it was

    Encode::encode("MIME-Header", "... \x{20AC} ...")

    The line, that is, which indicates that the parameter, here written

    "... \x{20AC} ..."

    must *not* have been returned from Encode::encode("UTF-8", ...)
    *even* *though* the documentation names the parameter "$utf8".
    (Which is ubiquitously implying "bytes, not text" in Perl/Python
    /Ruby, to users of equally ubiquitous packages requiring
    one or the other.)

    With all the discussion in Perl's docs related to Unicode, it seems
    pretty obvious that the variable $utf8 in Encode::MIME::Header's docs
    is expressing an expectation. Viz., that the programmer is expected
    to understand that the thing named "$utf8" is so named for a reason.
    That it is to be of a certain kind.
    That kind has nothing to do with explicit UTF-8 encoding, however,
    that's the issue. The issue is the expectation that the name "$utf8"
    is generating. (Add code points in the Latin-1 range, source encoding,
    web form data, data from SQL-tables, stackoverflo questions about
    encoding etc., then you get the overly rich set of possible
    nterpretations.)


    __
    Please fix USENET caused spam, etc. first, if simple inference
    of real names from obviously available information is asking
    too much of you, Spock, which I cannot imagine to be the case!
    Georg.
     
    G.B., Apr 24, 2014
    #4
  5. You talked nonsense about bytes *vs.* text; that required correction.
    The code that you posted does not contain either call. It contained

    which had printed, by your account and in my tests,

    which is (basically) the Quoted-Printable representation of the string with
    UTF-8 as its “charset†value (RFC 2047, §4.2). It makes sense to use UTF-8
    for the “charset†here because Unicode is AFAIK the only character set
    featuring that code point value. Also, UTF-8 provides, at average, the
    shortest possible encoding for Unicode characters [1] – here: three 8-bit
    bytes – and is well supported by user agents. So apparently it works as
    designed. (It appears to be independent of the locale, too.)

    Utter nonsense.
    BTDT. When will you start?
    An e-mail address, if it is an address (the thing in your From header field
    value is _not_, because “a mailbox *receives* mailâ€; see RFC 5322, §3.4.1,
    referred by RFC 5536, §3.1.2) is just an *address*. Inferring from that the
    real name and other personal information is fallacious, as you can see e. g.
    in my From header field value. Likewise for domain names.
    You are demanding diligence, even fallacious inference, in communication
    from others (the POD writers, me), but you yourself do not even care to
    observe basic rules of communication. Hypocrite.

    [en] <http://www.interhack.net/pubs/munging-harmful/>
    [de] <http://www.gerlo.de/falsche-email-adressen.html>
    [de] <http://www.arcor.de/rd/doc.agb_vf>

    You have been warned.

    *PLONK*
     
    Thomas 'PointedEars' Lahn, Apr 24, 2014
    #5
  6. G.B.

    G.B. Guest

    On 24.04.14 12:21, Thomas 'PointedEars' Lahn wrote:

    $header = encode('MIME-Header', $utf8);
    It's about how the input variable "$utf8" is misnamed,
    thus misleading.

    For bytes vs. strings, I'm referring to

    $ man perlunicode
    $ man MIME::Base64
     
    G.B., Apr 24, 2014
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.