Part of RFC 822 ignored by email module

Bob Kline · Jan 20, 2011

I just noticed that the following passage in RFC 822:

The process of moving from this folded multiple-line
representation of a header field to its single line represen-
tation is called "unfolding". Unfolding is accomplished by
regarding CRLF immediately followed by a LWSP-char as
equivalent to the LWSP-char.

is not being honored by the email module. The following two invocations
of message_from_string() should return the same value, but that's not
what happens:
' blah'

Note the space in front of the second value returned, but missing from
the first. Can someone convince me that this is not a bug?

Carl Banks · Jan 20, 2011

I just noticed that the following passage in RFC 822:

The process of moving from this folded multiple-line
representation of a header field to its single line represen-
tation is called "unfolding". Unfolding is accomplished by
regarding CRLF immediately followed by a LWSP-char as
equivalent to the LWSP-char.

is not being honored by the email module. The following two invocations
of message_from_string() should return the same value, but that's not
what happens:

>>> import email
>>> email.message_from_string("Subject: blah").get('SUBJECT')
'blah'
>>> email.message_from_string("Subject:\n blah").get('SUBJECT')
' blah'

Note the space in front of the second value returned, but missing from
the first. Can someone convince me that this is not a bug?

That's correct, according to my reading of RFC 822 (I doubt it's
changed so I didn't bother to look up what the latest RFC on that
subject is.)

The RFC says that in a folded line the whitespace on the following
line is considered a part of the line. Relevant quite (section
3.1.1):

Each header field can be viewed as a single, logical line of
ASCII characters, comprising a field-name and a field-body.
For convenience, the field-body portion of this conceptual
entity can be split into a multiple-line representation; this
is called "folding". The general rule is that wherever there
may be linear-white-space (NOT simply LWSP-chars), a CRLF
immediately followed by AT LEAST one LWSP-char may instead be
inserted. Thus, the single line

To: "Joe & J. Harvey" <ddd @Org>, JJV @ BBN

can be represented as:

To: "Joe & J. Harvey" <ddd @ Org>,
JJV@BBN

and

To: "Joe & J. Harvey"
<ddd@ Org>, JJV
@BBN

and

To: "Joe &
J. Harvey" <ddd @ Org>, JJV @ BBN

The process of moving from this folded multiple-line
representation of a header field to its single line represen-
tation is called "unfolding". Unfolding is accomplished by
regarding CRLF immediately followed by a LWSP-char as
equivalent to the LWSP-char.

Carl Banks

Bob Kline · Jan 20, 2011

That's correct, according to my reading of RFC 822 (I doubt it's
changed so I didn't bother to look up what the latest RFC on that
subject is.)

The RFC says that in a folded line the whitespace on the following
line is considered a part of the line.

Thanks for responding. I think your interpretation of the RFC is the
same is mine. What I'm saying is that by not returning the same value
in the two cases above the module is not "regarding CRLF immediately
followed by a LWSP-char as equivalent to the LWSP-char."

Martin Gregorie · Jan 20, 2011

Thanks for responding. I think your interpretation of the RFC is the
same is mine. What I'm saying is that by not returning the same value
in the two cases above the module is not "regarding CRLF immediately
followed by a LWSP-char as equivalent to the LWSP-char."

That's only a problem if your code cares about the composition of the
whitespace and this, IMO is incorrect behaviour. When the separator
between syntactic elements in a header is 'whitespace' it should not
matter what combination of newlines, tabs and spaces make up the
whitespace element.

Bob Kline · Jan 20, 2011

That's only a problem if your code cares about the composition of the
whitespace and this, IMO is incorrect behaviour. When the separator
between syntactic elements in a header is 'whitespace' it should not
matter what combination of newlines, tabs and spaces make up the
whitespace element.

That would be true for what the RFC calls "structured" fields, but not
for the others (such as the Subject header).

Martin Gregorie · Jan 20, 2011

That would be true for what the RFC calls "structured" fields, but not
for the others (such as the Subject header).

Subject text comparisons should work correctly if you were to split the
subject text using the 'whitespace' definition and then reassemble it
using a single space in place of each whitespace separator. Its either
that or assuming that all MUAs use the same line length and all use a
line split of "CRLF " - the whitespace that's needed to align the
continuation with the test on the first subject line. Many MUAs will do
that, but its unlikely that all will.

Bob Kline · Jan 20, 2011

Subject text comparisons should work correctly if you were to split the
subject text using the 'whitespace' definition and then reassemble it
using a single space in place of each whitespace separator. Its either
that or assuming that all MUAs use the same line length and all use a
line split of "CRLF " - the whitespace that's needed to align the
continuation with the test on the first subject line. Many MUAs will do
that, but its unlikely that all will.

Thanks. I'm not sure everyone would agree that it's OK to collapse
multiple consecutive spaces into one, but I'm beginning to suspect that
those more concerned with preserving as much as possible of the original
message are in the minority. It sounds like my take-home distillation
from this thread is "yes, the module ignores what the spec says about
unfolding, but it doesn't matter." I guess I can live with that.

Martin Gregorie · Jan 20, 2011

Thanks. I'm not sure everyone would agree that it's OK to collapse
multiple consecutive spaces into one, but I'm beginning to suspect that
those more concerned with preserving as much as possible of the original
message are in the minority. It sounds like my take-home distillation
from this thread is "yes, the module ignores what the spec says about
unfolding, but it doesn't matter." I guess I can live with that.

I've been doing stuff in this area with the JavaMail package, though not
as yet in Python. I've learnt that you parse the headers you can extract
values that work well for comparisons, as database keys, etc. but are not
guaranteed to let you reconstitute the original header byte for byte. If
preserving the message exactly as received the solution is to parse the
message to extract the headers and MIME parts you need for the
application to carry out its function, but keep the original, unparsed
message so you can pass it on.

The other gotcha is assuming that the MUA author read and understood the
RFCs. Very many barely glanced at RFCs and/or misunderstood them.
Consequently, if you use strict parsing you'll be surprised how many
messages get rejected for having invalid headers or MIME headers. Fot
instance, the mistakes some MUAs make when outputting To, CC and BCC
headers with multiple addresses have to be seen to be believed. If the
Python e-mail module lets you, set it to use lenient parsing. If this
isn't an option you may well find yourself having to fix up messages
before you can parse them successfully.

Carl Banks · Jan 21, 2011

Thanks for responding. I think your interpretation of the RFC is the
same is mine. What I'm saying is that by not returning the same value
in the two cases above the module is not "regarding CRLF immediately
followed by a LWSP-char as equivalent to the LWSP-char."

That makes sense. The space after \n is part of the reconstructed
subject and the email module should have treated it same as if the
line hadn't been folded. I agree that it's a bug. The line-folding
needs to be moved earlier in the parse process.

Carl Banks

rfc: a self-editing script	3	Oct 10, 2009
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
Use of logging module to track TODOs	0	Nov 27, 2013
Comments in ConfigParser module	4	Apr 6, 2007
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
Copying part of a vector element to a string variable	3	Oct 8, 2013
The pty module, reading from a pty, and Python 2/3	0	Oct 24, 2012
STARTTLS extension not supported by server	2	Nov 15, 2010

Part of RFC 822 ignored by email module

Bob Kline

Carl Banks

Bob Kline

Martin Gregorie

Bob Kline

Martin Gregorie

Bob Kline

Martin Gregorie

Carl Banks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads