How do I decode unicode characters in the subject usingemail.message_from_string()?

R

Roy H. Han

Dear python-list,

I'm having some trouble decoding an email header using the standard
imaplib.IMAP4 class and email.message_from_string method.

In particular, email.message_from_string() does not seem to properly
decode unicode characters in the subject.

How do I decode unicode characters in the subject?

I read on the documentation that the email module supports RFC 2047.
But is there a way to make imaplib.IMAP4 and email.message_from_string
use this protocol? I'm using Python 2.5.2 on Fedora. Perhaps this
problem has been fixed already in Python 2.6.

RHH
 
J

John Machin

Dear python-list,

I'm having some trouble decoding an email header using the standard
imaplib.IMAP4 class and email.message_from_string method.

In particular, email.message_from_string() does not seem to properly
decode unicode characters in the subject.

How do I decode unicode characters in the subject?

You don't. You can't. You decode str objects into unicode objects. You
encode unicode objects into str objects. If your input is not a str
object, you have a problem.

I'm no expert on the email package, but experts don't have crystal
balls, so let's gather some data for them while we're waiting for
their timezones to align:

Presumably your code is doing something like:
msg = email.message_from_string(a_string)

Please report the results of
print repr(a_string)
and
print type(msg)
print msg.items()
and tell us what you expected.

Cheers,
John
 
R

rdmurray

John Machin said:
You don't. You can't. You decode str objects into unicode objects. You
encode unicode objects into str objects. If your input is not a str
object, you have a problem.

I can't speak for the OP, but I had a similar (and possibly
identical-in-intent) question. Suppose you have a Subject line that
looks like this:

Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

How do you get the email module to decode that into unicode? The same
question applies to the other header lines, and the answer is it isn't
easy, and I had to read and reread the docs and experiment for a while
to figure it out. I understand there's going to be a sprint on the
email module at pycon, maybe some of this will get improved then.

Here's the final version of my test program. The third to last line is
one I thought ought to work given that Header has a __unicode__ method.
The final line is the one that did work (note the kludge to turn None
into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
and this code shows why!)

-------------------------------------------------------------------
from email import message_from_string
from email.header import Header, decode_header

x = message_from_string("""\
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.
""")

print x
print "--------------------"
for key, header in x.items():
print key, 'type', type(header)
print key+":", unicode(Header(header)).decode('utf-8')
print key+":", decode_header(header)
print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
-------------------------------------------------------------------


From nobody Wed Feb 25 08:35:29 2009
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.

--------------------
To type <type 'str'>
To: test
To: [('test', None)]
To: test
Subject type <type 'str'>
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'. (7)", 'iso-8859-1')]
Subject: 'u' Obselete type-- it is identical to 'd'. (7)


--RDM
 
R

Roy H. Han

Thanks for writing back, RDM and John Machin. Tomorrow I'll try the
code you suggested, RDM. It looks quite helpful and I'll report the
results.

In the meantime, John asked for more data. The sender's email client
is Microsoft Outlook 11. The recipient email client is Lotus Notes.



Actual Subject
=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=

Expected Subject
Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records

X-Mailer
Microsoft Office Outlook 11

X-MimeOLE
Produced By Microsoft MimeOLE V6.00.2900.5579



RHH



John Machin said:
You don't. You can't. You decode str objects into unicode objects. You
encode unicode objects into str objects. If your input is not a str
object, you have a problem.

I can't speak for the OP, but I had a similar (and possibly
identical-in-intent) question.  Suppose you have a Subject line that
looks like this:

   Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

How do you get the email module to decode that into unicode?  The same
question applies to the other header lines, and the answer is it isn't
easy, and I had to read and reread the docs and experiment for a while
to figure it out.  I understand there's going to be a sprint on the
email module at pycon, maybe some of this will get improved then.

Here's the final version of my test program.  The third to last line is
one I thought ought to work given that Header has a __unicode__ method.
The final line is the one that did work (note the kludge to turn None
into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
and this code shows why!)

-------------------------------------------------------------------
from email import message_from_string
from email.header import Header, decode_header

x = message_from_string("""\
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.
""")

print x
print "--------------------"
for key, header in x.items():
   print key, 'type', type(header)
   print key+":", unicode(Header(header)).decode('utf-8')
   print key+":", decode_header(header)
   print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
-------------------------------------------------------------------


   From nobody Wed Feb 25 08:35:29 2009
   To: test
   Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
           =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

   this is a test.

   --------------------
   To type <type 'str'>
   To: test
   To: [('test', None)]
   To: test
   Subject type <type 'str'>
   Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=   =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
   Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'.. (7)", 'iso-8859-1')]
   Subject: 'u' Obselete type-- it is identical to 'd'. (7)


--RDM
 
S

Steve Holden

Roy said:
On Feb 25, 11:07=A0am, "Roy H. Han" <[email protected]>
wrote:
Dear python-list,

I'm having some trouble decoding an email header using the standard
imaplib.IMAP4 class and email.message_from_string method.

In particular, email.message_from_string() does not seem to properly
decode unicode characters in the subject.

How do I decode unicode characters in the subject?
You don't. You can't. You decode str objects into unicode objects. You
encode unicode objects into str objects. If your input is not a str
object, you have a problem.
I can't speak for the OP, but I had a similar (and possibly
identical-in-intent) question. Suppose you have a Subject line that
looks like this:

Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

How do you get the email module to decode that into unicode? The same
question applies to the other header lines, and the answer is it isn't
easy, and I had to read and reread the docs and experiment for a while
to figure it out. I understand there's going to be a sprint on the
email module at pycon, maybe some of this will get improved then.

Here's the final version of my test program. The third to last line is
one I thought ought to work given that Header has a __unicode__ method.
The final line is the one that did work (note the kludge to turn None
into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
and this code shows why!)

-------------------------------------------------------------------
from email import message_from_string
from email.header import Header, decode_header

x = message_from_string("""\
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.
""")

print x
print "--------------------"
for key, header in x.items():
print key, 'type', type(header)
print key+":", unicode(Header(header)).decode('utf-8')
print key+":", decode_header(header)
print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
-------------------------------------------------------------------


From nobody Wed Feb 25 08:35:29 2009
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.

--------------------
To type <type 'str'>
To: test
To: [('test', None)]
To: test
Subject type <type 'str'>
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'. (7)", 'iso-8859-1')]
Subject: 'u' Obselete type-- it is identical to 'd'. (7)
Thanks for writing back, RDM and John Machin. Tomorrow I'll try the
code you suggested, RDM. It looks quite helpful and I'll report the
results.

In the meantime, John asked for more data. The sender's email client
is Microsoft Outlook 11. The recipient email client is Lotus Notes.



Actual Subject
=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=

Expected Subject
Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records

X-Mailer
Microsoft Office Outlook 11

X-MimeOLE
Produced By Microsoft MimeOLE V6.00.2900.5579
[/QUOTE]
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]
regards
Steve
 
S

Steve Holden

Roy said:
On Feb 25, 11:07=A0am, "Roy H. Han" <[email protected]>
wrote:
Dear python-list,

I'm having some trouble decoding an email header using the standard
imaplib.IMAP4 class and email.message_from_string method.

In particular, email.message_from_string() does not seem to properly
decode unicode characters in the subject.

How do I decode unicode characters in the subject?
You don't. You can't. You decode str objects into unicode objects. You
encode unicode objects into str objects. If your input is not a str
object, you have a problem.
I can't speak for the OP, but I had a similar (and possibly
identical-in-intent) question. Suppose you have a Subject line that
looks like this:

Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

How do you get the email module to decode that into unicode? The same
question applies to the other header lines, and the answer is it isn't
easy, and I had to read and reread the docs and experiment for a while
to figure it out. I understand there's going to be a sprint on the
email module at pycon, maybe some of this will get improved then.

Here's the final version of my test program. The third to last line is
one I thought ought to work given that Header has a __unicode__ method.
The final line is the one that did work (note the kludge to turn None
into 'ascii'...IMO 'ascii' is what deocde_header _should_ be returning,
and this code shows why!)

-------------------------------------------------------------------
from email import message_from_string
from email.header import Header, decode_header

x = message_from_string("""\
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.
""")

print x
print "--------------------"
for key, header in x.items():
print key, 'type', type(header)
print key+":", unicode(Header(header)).decode('utf-8')
print key+":", decode_header(header)
print key+":", ''.join([s.decode(t or 'ascii') for (s, t) in decode_header(header)]).encode('utf-8')
-------------------------------------------------------------------


From nobody Wed Feb 25 08:35:29 2009
To: test
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?=
=?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=

this is a test.

--------------------
To type <type 'str'>
To: test
To: [('test', None)]
To: test
Subject type <type 'str'>
Subject: 'u' Obselete type =?ISO-8859-1?Q?--_it_is_identical_?= =?ISO-8859-1?Q?to_=27d=27=2E_=287=29?=
Subject: [("'u' Obselete type", None), ("-- it is identical to 'd'. (7)", 'iso-8859-1')]
Subject: 'u' Obselete type-- it is identical to 'd'. (7)
Thanks for writing back, RDM and John Machin. Tomorrow I'll try the
code you suggested, RDM. It looks quite helpful and I'll report the
results.

In the meantime, John asked for more data. The sender's email client
is Microsoft Outlook 11. The recipient email client is Lotus Notes.



Actual Subject
=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=

Expected Subject
Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR Records

X-Mailer
Microsoft Office Outlook 11

X-MimeOLE
Produced By Microsoft MimeOLE V6.00.2900.5579
[/QUOTE]
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]
regards
Steve
 
R

rdmurray

Steve Holden said:
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]

It is interesting that decode_header does what I would consider to be
the right thing (from a pragmatic standpoint) with that particular bit
of Microsoft not-quite-standards-compliant brain-damage; but, removing
the tab is not in fact standards compliant if I'm reading the RFC
correctly.

--RDM
 
R

Roy H. Han

Cool, it works!

Thanks, RDM, for stating the right approach.
Thanks, Steve, for teaching by example.

I wonder why the email.message_from_string() method doesn't call
email.header.decode_header() automatically.


Steve Holden said:
from email.header import decode_header
print
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip:  Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]

It is interesting that decode_header does what I would consider to be
the right thing (from a pragmatic standpoint) with that particular bit
of Microsoft not-quite-standards-compliant brain-damage; but, removing
the tab is not in fact standards compliant if I'm reading the RFC
correctly.

--RDM
 
S

Steve Holden

Steve Holden said:
from email.header import decode_header
print
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]

It is interesting that decode_header does what I would consider to be
the right thing (from a pragmatic standpoint) with that particular bit
of Microsoft not-quite-standards-compliant brain-damage; but, removing
the tab is not in fact standards compliant if I'm reading the RFC
correctly.
You'd need to quote me chapter and verse on that. I understood that the
tab simply indicated continuation, but it's a *long* time since I read
the RFCs.

regards
Steve
 
T

Thorsten Kampe

* Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)
Thanks, RDM, for stating the right approach.
Thanks, Steve, for teaching by example.

I wonder why the email.message_from_string() method doesn't call
email.header.decode_header() automatically.

And I wonder why you would think the header contains Unicode characters
when it says "us-ascii" ("=?us-ascii?Q?"). I think there is a tendency
to label everything "Unicode" someone does not understand.

Thorsten
 
G

Gabriel Genellina

En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe
* Roy H. Han (Wed, 25 Feb 2009 10:17:22 -0500)

And I wonder why you would think the header contains Unicode characters
when it says "us-ascii" ("=?us-ascii?Q?"). I think there is a tendency
to label everything "Unicode" someone does not understand.

And I wonder why you would think the header does *not* contain Unicode
characters when it says "us-ascii"?. I think there is a tendency here
too...
 
T

Thorsten Kampe

* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe


And I wonder why you would think the header does *not* contain Unicode
characters when it says "us-ascii"?.

Basically because it didn't contain any Unicode characters (anything
outside the ASCII range).

Thorsten
 
T

Tim Golden

Thorsten said:
* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)

Basically because it didn't contain any Unicode characters (anything
outside the ASCII range).

And I imagine that Gabriel's point was -- and my point certainly
is -- that Unicode includes all the characters *inside* the
ASCII range.


TJG
 
G

Gabriel Genellina

En Wed, 25 Feb 2009 15:01:08 -0200, Thorsten Kampe
* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)

Basically because it didn't contain any Unicode characters (anything
outside the ASCII range).

I think you have to revise your definition of "Unicode".
 
R

rdmurray

Steve Holden said:
Steve Holden said:
from email.header import decode_header
print
decode_header("=?us-ascii?Q?Inteum_C/SR_User_Tip:__Quick_Access_to_Recently_Opened_Inteu?=\r\n\t=?us-ascii?Q?m_C/SR_Records?=")
[('Inteum C/SR User Tip: Quick Access to Recently Opened Inteum C/SR
Records', 'us-ascii')]

It is interesting that decode_header does what I would consider to be
the right thing (from a pragmatic standpoint) with that particular bit
of Microsoft not-quite-standards-compliant brain-damage; but, removing
the tab is not in fact standards compliant if I'm reading the RFC
correctly.
You'd need to quote me chapter and verse on that. I understood that the
tab simply indicated continuation, but it's a *long* time since I read
the RFCs.

Tab is not mentioned in RFC 2822 except to say that it is a valid
whitespace character. Header folding (insertion of <cr><lf>) can
occur most places whitespace appears, and is defined in section
2.2.3 thusly:

Each header field is logically a single line of characters comprising
the field name, the colon, and the field body. For convenience
however, and to deal with the 998/78 character limitations per line,
the field body portion of a header field can be split into a multiple
line representation; this is called "folding". The general rule is
that wherever this standard allows for folding white space (not
simply WSP characters), a CRLF may be inserted before any WSP. For
example, the header field:

Subject: This is a test

can be represented as:

Subject: This
is a test

[irrelevant note elided]

The process of moving from this folded multiple-line representation
of a header field to its single line representation is called
"unfolding". Unfolding is accomplished by simply removing any CRLF
that is immediately followed by WSP. Each header field should be
treated in its unfolded form for further syntactic and semantic
evaluation.

So, the whitespace characters are supposed to be left unchanged
after unfolding.

--David
 
S

Steve Holden

The process of moving from this folded multiple-line representation
of a header field to its single line representation is called
"unfolding". Unfolding is accomplished by simply removing any CRLF
that is immediately followed by WSP. Each header field should be
treated in its unfolded form for further syntactic and semantic
evaluation.

So, the whitespace characters are supposed to be left unchanged
after unfolding.
That would certainly appear to be the case. Thanks.

regards
Steve
 
T

Thorsten Kampe

* Tim Golden (Wed, 25 Feb 2009 17:27:07 +0000)
Thorsten said:
* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe [...]
And I wonder why you would think the header contains Unicode characters
when it says "us-ascii" ("=?us-ascii?Q?"). I think there is a tendency
to label everything "Unicode" someone does not understand.
And I wonder why you would think the header does *not* contain Unicode
characters when it says "us-ascii"?.

Basically because it didn't contain any Unicode characters (anything
outside the ASCII range).

And I imagine that Gabriel's point was -- and my point certainly
is -- that Unicode includes all the characters *inside* the
ASCII range.

I know that this was Gabriel's point. And my point was that Gabriel's
point was pointless. If you call any text (or character) "Unicode" then
the word "Unicode" is generalized to an extent where it doesn't mean
anything at all anymore and becomes a buzz word.

With the same reason you could call ASCII an Unicode encoding (which it
isn't) because all ASCII characters are Unicode characters (code
points). Only encodings that cover the full Unicode range can reasonably
be called Unicode encodings.

The OP just saw some "weird characters" in the email subject and thought
"I know. It looks weird. Must be Unicode". But it wasn't. It was good
ole ASCII - only Quoted Printable encoded.


Thorsten
 
G

Gabriel Genellina

Tab is not mentioned in RFC 2822 except to say that it is a valid
whitespace character. Header folding (insertion of <cr><lf>) can
occur most places whitespace appears, and is defined in section
2.2.3 thusly: [...]
So, the whitespace characters are supposed to be left unchanged
after unfolding.

Yep, there is an old bug report sleeping in the tracker about this...
 
G

Gabriel Genellina

En Wed, 25 Feb 2009 16:19:35 -0200, Thorsten Kampe
* Tim Golden (Wed, 25 Feb 2009 17:27:07 +0000)
Thorsten said:
* Gabriel Genellina (Wed, 25 Feb 2009 14:00:16 -0200)
En Wed, 25 Feb 2009 13:40:31 -0200, Thorsten Kampe [...]
And I wonder why you would think the header contains Unicode characters
when it says "us-ascii" ("=?us-ascii?Q?"). I think there is a tendency
to label everything "Unicode" someone does not understand.
And I wonder why you would think the header does *not* contain Unicode
characters when it says "us-ascii"?.

Basically because it didn't contain any Unicode characters (anything
outside the ASCII range).

And I imagine that Gabriel's point was -- and my point certainly
is -- that Unicode includes all the characters *inside* the
ASCII range.

I know that this was Gabriel's point. And my point was that Gabriel's
point was pointless. If you call any text (or character) "Unicode" then
the word "Unicode" is generalized to an extent where it doesn't mean
anything at all anymore and becomes a buzz word.

If it's text, it should use Unicode. Maybe not now, but in a few years, it
will be totally unacceptable not to properly use Unicode to process
textual data.
With the same reason you could call ASCII an Unicode encoding (which it
isn't) because all ASCII characters are Unicode characters (code
points). Only encodings that cover the full Unicode range can reasonably
be called Unicode encodings.

Not at all. ASCII is as valid as character encoding ("coded character set"
as the Unicode guys like to say) as ISO 10646 (which covers the whole
range).
The OP just saw some "weird characters" in the email subject and thought
"I know. It looks weird. Must be Unicode". But it wasn't. It was good
ole ASCII - only Quoted Printable encoded.

Good f*cked ASCII is Unicode too.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top