Part of RFC 822 ignored by email module

Discussion in 'Python' started by Bob Kline, Jan 20, 2011.

  1. Bob Kline

    Bob Kline Guest

    I just noticed that the following passage in RFC 822:

    The process of moving from this folded multiple-line
    representation of a header field to its single line represen-
    tation is called "unfolding". Unfolding is accomplished by
    regarding CRLF immediately followed by a LWSP-char as
    equivalent to the LWSP-char.

    is not being honored by the email module. The following two invocations
    of message_from_string() should return the same value, but that's not
    what happens:

    >>> import email
    >>> email.message_from_string("Subject: blah").get('SUBJECT')

    'blah'
    >>> email.message_from_string("Subject:\n blah").get('SUBJECT')

    ' blah'

    Note the space in front of the second value returned, but missing from
    the first. Can someone convince me that this is not a bug?

    --
    Bob Kline
    http://www.rksystems.com
    mailto:
     
    Bob Kline, Jan 20, 2011
    #1
    1. Advertising

  2. Bob Kline

    Carl Banks Guest

    On Jan 20, 7:08 am, Bob Kline <> wrote:
    > I just noticed that the following passage in RFC 822:
    >
    >          The process of moving  from  this  folded   multiple-line
    >          representation  of a header field to its single line represen-
    >          tation is called "unfolding".  Unfolding  is  accomplished  by
    >          regarding   CRLF   immediately  followed  by  a  LWSP-char  as
    >          equivalent to the LWSP-char.
    >
    > is not being honored by the email module.  The following two invocations
    > of message_from_string() should return the same value, but that's not
    > what happens:
    >
    >  >>> import email
    >  >>> email.message_from_string("Subject: blah").get('SUBJECT')
    > 'blah'
    >  >>> email.message_from_string("Subject:\n blah").get('SUBJECT')
    > ' blah'
    >
    > Note the space in front of the second value returned, but missing from
    > the first.  Can someone convince me that this is not a bug?


    That's correct, according to my reading of RFC 822 (I doubt it's
    changed so I didn't bother to look up what the latest RFC on that
    subject is.)

    The RFC says that in a folded line the whitespace on the following
    line is considered a part of the line. Relevant quite (section
    3.1.1):


    Each header field can be viewed as a single, logical line of
    ASCII characters, comprising a field-name and a field-body.
    For convenience, the field-body portion of this conceptual
    entity can be split into a multiple-line representation; this
    is called "folding". The general rule is that wherever there
    may be linear-white-space (NOT simply LWSP-chars), a CRLF
    immediately followed by AT LEAST one LWSP-char may instead be
    inserted. Thus, the single line

    To: "Joe & J. Harvey" <ddd @Org>, JJV @ BBN

    can be represented as:

    To: "Joe & J. Harvey" <ddd @ Org>,
    JJV@BBN

    and

    To: "Joe & J. Harvey"
    <ddd@ Org>, JJV
    @BBN

    and

    To: "Joe &
    J. Harvey" <ddd @ Org>, JJV @ BBN

    The process of moving from this folded multiple-line
    representation of a header field to its single line represen-
    tation is called "unfolding". Unfolding is accomplished by
    regarding CRLF immediately followed by a LWSP-char as
    equivalent to the LWSP-char.


    Carl Banks
     
    Carl Banks, Jan 20, 2011
    #2
    1. Advertising

  3. Bob Kline

    Bob Kline Guest

    On 1/20/2011 12:23 PM, Carl Banks wrote:
    > On Jan 20, 7:08 am, Bob Kline<> wrote:
    >> I just noticed that the following passage in RFC 822:
    >>
    >> The process of moving from this folded multiple-line
    >> representation of a header field to its single line represen-
    >> tation is called "unfolding". Unfolding is accomplished by
    >> regarding CRLF immediately followed by a LWSP-char as
    >> equivalent to the LWSP-char.
    >>
    >> is not being honored by the email module. The following two invocations
    >> of message_from_string() should return the same value, but that's not
    >> what happens:
    >>
    >> >>> import email
    >> >>> email.message_from_string("Subject: blah").get('SUBJECT')

    >> 'blah'
    >> >>> email.message_from_string("Subject:\n blah").get('SUBJECT')

    >> ' blah'
    >>
    >> Note the space in front of the second value returned, but missing from
    >> the first. Can someone convince me that this is not a bug?

    > That's correct, according to my reading of RFC 822 (I doubt it's
    > changed so I didn't bother to look up what the latest RFC on that
    > subject is.)
    >
    > The RFC says that in a folded line the whitespace on the following
    > line is considered a part of the line.


    Thanks for responding. I think your interpretation of the RFC is the
    same is mine. What I'm saying is that by not returning the same value
    in the two cases above the module is not "regarding CRLF immediately
    followed by a LWSP-char as equivalent to the LWSP-char."

    --
    Bob Kline
    http://www.rksystems.com
    mailto:
     
    Bob Kline, Jan 20, 2011
    #3
  4. On Thu, 20 Jan 2011 12:55:44 -0500, Bob Kline wrote:

    > On 1/20/2011 12:23 PM, Carl Banks wrote:
    >> On Jan 20, 7:08 am, Bob Kline<> wrote:
    >>> I just noticed that the following passage in RFC 822:
    >>>
    >>> The process of moving from this folded multiple-line
    >>> representation of a header field to its single line
    >>> represen- tation is called "unfolding". Unfolding is
    >>> accomplished by regarding CRLF immediately followed
    >>> by a LWSP-char as equivalent to the LWSP-char.
    >>>
    >>> is not being honored by the email module. The following two
    >>> invocations of message_from_string() should return the same value, but
    >>> that's not what happens:
    >>>
    >>> >>> import email
    >>> >>> email.message_from_string("Subject: blah").get('SUBJECT')
    >>> 'blah'
    >>> >>> email.message_from_string("Subject:\n blah").get('SUBJECT')
    >>> ' blah'
    >>>
    >>> Note the space in front of the second value returned, but missing from
    >>> the first. Can someone convince me that this is not a bug?

    >> That's correct, according to my reading of RFC 822 (I doubt it's
    >> changed so I didn't bother to look up what the latest RFC on that
    >> subject is.)
    >>
    >> The RFC says that in a folded line the whitespace on the following line
    >> is considered a part of the line.

    >
    > Thanks for responding. I think your interpretation of the RFC is the
    > same is mine. What I'm saying is that by not returning the same value
    > in the two cases above the module is not "regarding CRLF immediately
    > followed by a LWSP-char as equivalent to the LWSP-char."
    >

    That's only a problem if your code cares about the composition of the
    whitespace and this, IMO is incorrect behaviour. When the separator
    between syntactic elements in a header is 'whitespace' it should not
    matter what combination of newlines, tabs and spaces make up the
    whitespace element.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Jan 20, 2011
    #4
  5. Bob Kline

    Bob Kline Guest

    On 1/20/2011 3:48 PM, Martin Gregorie wrote:
    > That's only a problem if your code cares about the composition of the
    > whitespace and this, IMO is incorrect behaviour. When the separator
    > between syntactic elements in a header is 'whitespace' it should not
    > matter what combination of newlines, tabs and spaces make up the
    > whitespace element.


    That would be true for what the RFC calls "structured" fields, but not
    for the others (such as the Subject header).

    --
    Bob Kline
    http://www.rksystems.com
    mailto:
     
    Bob Kline, Jan 20, 2011
    #5
  6. On Thu, 20 Jan 2011 16:25:52 -0500, Bob Kline wrote:

    > On 1/20/2011 3:48 PM, Martin Gregorie wrote:
    >> That's only a problem if your code cares about the composition of the
    >> whitespace and this, IMO is incorrect behaviour. When the separator
    >> between syntactic elements in a header is 'whitespace' it should not
    >> matter what combination of newlines, tabs and spaces make up the
    >> whitespace element.

    >
    > That would be true for what the RFC calls "structured" fields, but not
    > for the others (such as the Subject header).


    Subject text comparisons should work correctly if you were to split the
    subject text using the 'whitespace' definition and then reassemble it
    using a single space in place of each whitespace separator. Its either
    that or assuming that all MUAs use the same line length and all use a
    line split of "CRLF " - the whitespace that's needed to align the
    continuation with the test on the first subject line. Many MUAs will do
    that, but its unlikely that all will.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Jan 20, 2011
    #6
  7. Bob Kline

    Bob Kline Guest

    On 1/20/2011 5:34 PM, Martin Gregorie wrote:
    > On Thu, 20 Jan 2011 16:25:52 -0500, Bob Kline wrote:
    >
    >> On 1/20/2011 3:48 PM, Martin Gregorie wrote:
    >>> That's only a problem if your code cares about the composition of the
    >>> whitespace and this, IMO is incorrect behaviour. When the separator
    >>> between syntactic elements in a header is 'whitespace' it should not
    >>> matter what combination of newlines, tabs and spaces make up the
    >>> whitespace element.

    >> That would be true for what the RFC calls "structured" fields, but not
    >> for the others (such as the Subject header).

    > Subject text comparisons should work correctly if you were to split the
    > subject text using the 'whitespace' definition and then reassemble it
    > using a single space in place of each whitespace separator. Its either
    > that or assuming that all MUAs use the same line length and all use a
    > line split of "CRLF " - the whitespace that's needed to align the
    > continuation with the test on the first subject line. Many MUAs will do
    > that, but its unlikely that all will.


    Thanks. I'm not sure everyone would agree that it's OK to collapse
    multiple consecutive spaces into one, but I'm beginning to suspect that
    those more concerned with preserving as much as possible of the original
    message are in the minority. It sounds like my take-home distillation
    from this thread is "yes, the module ignores what the spec says about
    unfolding, but it doesn't matter." I guess I can live with that.

    --
    Bob Kline
    http://www.rksystems.com
    mailto:
     
    Bob Kline, Jan 20, 2011
    #7
  8. On Thu, 20 Jan 2011 17:58:36 -0500, Bob Kline wrote:

    > Thanks. I'm not sure everyone would agree that it's OK to collapse
    > multiple consecutive spaces into one, but I'm beginning to suspect that
    > those more concerned with preserving as much as possible of the original
    > message are in the minority. It sounds like my take-home distillation
    > from this thread is "yes, the module ignores what the spec says about
    > unfolding, but it doesn't matter." I guess I can live with that.
    >

    I've been doing stuff in this area with the JavaMail package, though not
    as yet in Python. I've learnt that you parse the headers you can extract
    values that work well for comparisons, as database keys, etc. but are not
    guaranteed to let you reconstitute the original header byte for byte. If
    preserving the message exactly as received the solution is to parse the
    message to extract the headers and MIME parts you need for the
    application to carry out its function, but keep the original, unparsed
    message so you can pass it on.

    The other gotcha is assuming that the MUA author read and understood the
    RFCs. Very many barely glanced at RFCs and/or misunderstood them.
    Consequently, if you use strict parsing you'll be surprised how many
    messages get rejected for having invalid headers or MIME headers. Fot
    instance, the mistakes some MUAs make when outputting To, CC and BCC
    headers with multiple addresses have to be seen to be believed. If the
    Python e-mail module lets you, set it to use lenient parsing. If this
    isn't an option you may well find yourself having to fix up messages
    before you can parse them successfully.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
     
    Martin Gregorie, Jan 20, 2011
    #8
  9. Bob Kline

    Carl Banks Guest

    On Jan 20, 9:55 am, Bob Kline <> wrote:
    > On 1/20/2011 12:23 PM, Carl Banks wrote:
    >
    >
    >
    > > On Jan 20, 7:08 am, Bob Kline<>  wrote:
    > >> I just noticed that the following passage in RFC 822:

    >
    > >>           The process of moving  from  this  folded   multiple-line
    > >>           representation  of a header field to its single line represen-
    > >>           tation is called "unfolding".  Unfolding  is  accomplished  by
    > >>           regarding   CRLF   immediately  followed  by  a  LWSP-char  as
    > >>           equivalent to the LWSP-char.

    >
    > >> is not being honored by the email module.  The following two invocations
    > >> of message_from_string() should return the same value, but that's not
    > >> what happens:

    >
    > >>   >>>  import email
    > >>   >>>  email.message_from_string("Subject: blah").get('SUBJECT')
    > >> 'blah'
    > >>   >>>  email.message_from_string("Subject:\n blah").get('SUBJECT')
    > >> ' blah'

    >
    > >> Note the space in front of the second value returned, but missing from
    > >> the first.  Can someone convince me that this is not a bug?

    > > That's correct, according to my reading of RFC 822 (I doubt it's
    > > changed so I didn't bother to look up what the latest RFC on that
    > > subject is.)

    >
    > > The RFC says that in a folded line the whitespace on the following
    > > line is considered a part of the line.

    >
    > Thanks for responding.  I think your interpretation of the RFC is the
    > same is mine.  What I'm saying is that by not returning the same value
    > in the two cases above the module is not "regarding CRLF immediately
    > followed by a LWSP-char as equivalent to the LWSP-char."


    That makes sense. The space after \n is part of the reconstructed
    subject and the email module should have treated it same as if the
    line hadn't been folded. I agree that it's a bug. The line-folding
    needs to be moved earlier in the parse process.


    Carl Banks
     
    Carl Banks, Jan 21, 2011
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Florian Lindner

    RFC 822 continuations

    Florian Lindner, May 2, 2006, in forum: Python
    Replies:
    2
    Views:
    297
  2. sorci-engine

    Problematical RFC 822 date-time value

    sorci-engine, Jun 7, 2007, in forum: XML
    Replies:
    3
    Views:
    835
    sorci-engine
    Jun 10, 2007
  3. Jure Sah
    Replies:
    2
    Views:
    1,008
    Peter Flynn
    Mar 9, 2009
  4. Ivan Shmakov
    Replies:
    3
    Views:
    1,225
    Kari Hurtta
    Feb 13, 2012
  5. kellygreer1

    RFC-822 dates into Ruby dates

    kellygreer1, Jun 8, 2008, in forum: Ruby
    Replies:
    1
    Views:
    254
    Eric I.
    Jun 8, 2008
Loading...

Share This Page