Flatten an email Message with a non-ASCII body using 8bit CTE

W

W. Trevor King

Hello list!

I'm trying to figure out how to flatten a MIMEText message to bytes
using an 8bit Content-Transfer-Encoding in Python 3.3. Here's what
I've tried so far:

# -*- encoding: utf-8 -*-
import email.encoders
from email.charset import Charset
from email.generator import BytesGenerator
from email.mime.text import MIMEText
import sys

body = 'ΖεÏÏ‚'
encoding = 'utf-8'
charset = Charset(encoding)
charset.body_encoding = email.encoders.encode_7or8bit

message = MIMEText(body, 'plain', encoding)
del message['Content-Transfer-Encoding']
message.set_payload(body, charset)
try:
BytesGenerator(sys.stdout.buffer).flatten(message)
except UnicodeEncodeError as e:
print('error with string input:')
print(e)

message = MIMEText(body, 'plain', encoding)
del message['Content-Transfer-Encoding']
message.set_payload(body.encode(encoding), charset)
try:
BytesGenerator(sys.stdout.buffer).flatten(message)
except TypeError as e:
print('error with byte input:')
print(e)

The `del m[…]; m.set_payload()` bits work around #16324 [1] and should
be orthogonal to the encoding issues. It's possible that #12553 is
trying to address this issue [2,3], but that issue's comments are a
bit vague, so I'm not sure.

The problem with the string payload is that
email.generator.BytesGenerator.write is getting the Unicode string
payload unencoded and trying to encode it as ASCII. It may be
possible to work around this by encoding the payload so that anything
that doesn't encode (using the body charset) to a 7bit value is
replaced with a surrogate escape, but I'm not sure how to do that.

The problem with the byte payload is that _has_surrogates (used in
email.generator.Generator._handle_text and
BytesGenerator._handle_text) chokes on byte input:

TypeError: can't use a string pattern on a bytes-like object

For UTF-8, you can get away with:

message.as_string().encode(message.get_charset().get_output_charset())

because the headers are encoded into 7 bits, so re-encoding them with
UTF-8 is a no-op. However, if the body charset is UTF-16-LE or any
other encoding that remaps 7bit characters, this hack breaks down.

Thoughts?
Trevor

[1]: http://bugs.python.org/issue16324
[2]: http://bugs.python.org/issue12553
[3]: http://bugs.python.org/issue12552#msg140294

--
This email may be signed or encrypted with GnuPG (http://www.gnupg.org).
For more information, see http://en.wikipedia.org/wiki/Pretty_Good_Privacy

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)

iQIcBAEBAgAGBQJRAN08AAoJEEUbTsx0l5OMfwAP/3oX6AhlhUhNVaUb99mVJe4C
moT+pN3ribyhdrxxy6elUxOzywGkVUIBlK29etu97LZIGNLUJ7/2qL1P6YF3oLE4
aODfAztnCicqWWmvjITMdfY54yJaspDdSMyO4lIN/5OtVnPYejLkWUEFI/CXqGgh
kFG/RQWAaRW49AESGWy+2pZCr3QaGeBUA6axoPHYa2b9H/5uN9OT8qUiOeVyBKBZ
n+gcb3PbK3nthIehr7W7fqZ6GtnXoDuIO9zSopVjrEfn0/BSJtvhdifv8pNezevN
tvuWTBCIMGAj76XO9nh7I7JZOtDHmmtSKb523pyZiZBkhMeTFcrH7MgNPJ3sT2Jx
+WKVW1ui/YmW5e2weXvEBlnYLpb/3lRzYLDsQAIgzPxPbmw14yQqJlobzPPyDDXN
GnjmRdEV7GaJekiOOiNxCCOYbwIvKv2Xm/txiEO25gotzYZUQ4AP2BXNamMStUmX
pFC+K8pPJNzeWpVUqzUTkYbWit2QgPUJWS4Dwt2kgV5Qv6ut0dYJaeCRWuttUoMx
jcxiL7uSN2g7czERVA/a81kzYsUphcUWtuO+nBVjl+8AGosLDamm6WOZtwVMzagm
vHgrlcJ9vIULDy9HiI9AkUrmiAKMKbYVu/X9OnMK85IdaFiJy6CCv+Lm9XDXoOiw
fuFfS/uVNPIRjAv9euT2
=OT7m
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top