Decoding 'funky' e-mail subjects

Discussion in 'Python' started by Jonas Galvez, Jun 7, 2004.

  1. Jonas Galvez

    Jonas Galvez Guest

    Hi, I need a function to parse badly encoded 'Subject' headers from
    e-mails, such as the following:

    =?ISO-8859-1?Q?Murilo_Corr=EAa?=
    =?ISO-8859-1?Q?Marcos_Mendon=E7a?=

    I tried using the decode() method from mimetools but that doesn't
    appear to be correct solution. I ended up coding the following:

    import re

    subject = "=?ISO-8859-1?Q?Murilo_Corr=EAa?="
    subject = re.search("(?:=\?[^\?]*\?\Q\?)?(.*)\?=", subject)
    subject = subject.group(1)

    def decodeEntity(str):
    str = str.group(1)
    try: return eval('"\\x%s"' % str)
    except: return "?"

    subject = re.sub("=([^=].)", decodeEntity, subject)
    print subject.replace("_", " ").decode("iso-8859-1")

    Can anyone recommend a safer method?

    Tia,



    \\ jonas galvez
    // jonasgalvez.com
     
    Jonas Galvez, Jun 7, 2004
    #1
    1. Advertising

  2. Jonas Galvez

    Oliver Kurz Guest

    Have you tried decode_header from email.Header in the python email-package?



    Best regards,

    Oliver
     
    Oliver Kurz, Jun 7, 2004
    #2
    1. Advertising

  3. Jonas Galvez

    Jonas Galvez Guest

    Oliver Kurz wrote:
    > Have you tried decode_header from email.Header
    > in the python email-package?


    Thanks, that works. The problem is that I need to make it compatible
    with Python 1.5.2. I improved my regex-based method and it has worked
    fine with all my test cases so far. But if anyone has any other
    suggestion, I'm still interested. Anyway, here's my code:

    import re
    from string import *

    def decodeHeader(h):
    def firstGroup(s):
    if s.group(1): return s.group(1)
    return s.group()
    h = re.compile("=\?[^\?]*\?q\?", re.I).sub("", h)
    h = re.compile(
    "=\?(?:(?:(?:(?:(?:(?:(?:(?:w)?i)?n)?d)?o)?w)?s)?|"
    "(?:(?:(?:i)?s)?o)?|(?:(?:(?:u)?t)?f)?)"
    "[^\.]*?(\.\.\.)?$",
    re.I).sub(firstGroup, h)
    h = re.sub("=.(\.\.\.)?$", firstGroup, h)
    def isoEntities(str):
    str = str.group(1)
    try: return eval('"\\x%s"' % str)
    except: return "?"
    h = re.sub("=([^=].)", isoEntities, h)
    if h[-2:] == "?=": h = h[:-2]
    return replace(h, "_", " ")

    print decodeHeader("=?ISO-8859-1?Q?Marcos_Mendon=E7a?=")
    print decodeHeader("=?ISO-8859-1?Q?Test?=")
    print decodeHeader("=?UTF-8?Q?Test?=")
    print decodeHeader("Test =?windows-125...")
    print decodeHeader("Test =?window-125...")
    print decodeHeader("Test =?windo-1...")
    print decodeHeader("Test =?wind...")
    print decodeHeader("Test =?...")
    print decodeHeader("Test =?w...")
    print decodeHeader("Test =?iso...")




    \\ jonas galvez
    // jonasgalvez.com
     
    Jonas Galvez, Jun 7, 2004
    #3

  4. >> Have you tried decode_header from email.Header in the python
    >> email-package?


    Jonas> Thanks, that works. The problem is that I need to make it
    Jonas> compatible with Python 1.5.2.

    Why not just include email.Header.decode_header() in your app? Something
    like:

    try:
    from email.Header import decode_header
    except ImportError:
    # Python 1.5.2 compatibility...
    def decode_header(...):
    ...

    If that proves to be intractible, define yours when an ImportError is
    raised. In either case, you get the best solution when you can and only
    fall back to something possibly suboptimal when necessary.

    Skip
     
    Skip Montanaro, Jun 7, 2004
    #4
  5. Jonas Galvez

    Paul Rubin Guest

    "Jonas Galvez" <> writes:
    > Thanks, that works. The problem is that I need to make it compatible
    > with Python 1.5.2. I improved my regex-based method and it has worked
    > fine with all my test cases so far. But if anyone has any other
    > suggestion, I'm still interested. Anyway, here's my code:


    A lot of those funny subjects come from spammers. Never eval anything
    from anyone like that!!!
     
    Paul Rubin, Jun 8, 2004
    #5
  6. On 07 Jun 2004 17:20:02 -0700, rumours say that Paul Rubin
    <http://> might have written:

    >"Jonas Galvez" <> writes:
    >> Thanks, that works. The problem is that I need to make it compatible
    >> with Python 1.5.2. I improved my regex-based method and it has worked
    >> fine with all my test cases so far. But if anyone has any other
    >> suggestion, I'm still interested. Anyway, here's my code:


    >A lot of those funny subjects come from spammers. Never eval anything
    >from anyone like that!!!


    (The part of the code that caused Paul's comment):

    try: return eval('"\\x%s"' % str)
    except: return "?"

    A sound advice by Paul. However, lots of those funny subjects come in
    legitimate e-mails from countries where the ascii range is not enough.

    So, a safer alternative to the code above is:

    try: return string.atoi(str, 16)
    except: return '?'
    # int(s, base) was not available in 1.5.2
    --
    TZOTZIOY, I speak England very best,
    "I have a cunning plan, m'lord" --Sean Bean as Odysseus/Ulysses
     
    Christos TZOTZIOY Georgiou, Jun 8, 2004
    #6
  7. Jonas Galvez

    Jonas Galvez Guest

    Paul Rubin wrote:
    > A lot of those funny subjects come from spammers. Never eval
    > anything from anyone like that!!!


    Hi Paul, yeah, actually, that kind of 'funky' subject is very common
    on mailing-lists here in Brazil (where ISO-8859-1 is the standard). A
    lot of people use crappy webmail software which spills out that kind
    of mess. So I'm forced to deal with it :)

    By the way, this is for a mail2rss application which will enable easy
    removal/blacklisting of spam, among other things.

    Christos TZOTZIOY Georgiou wrote:
    > A sound advice by Paul. However, lots of those funny subjects come
    > in legitimate e-mails from countries where the ascii range is not
    > enough. So, a safer alternative to the code above is:
    >
    > try: return string.atoi(str, 16)
    > except: return '?'
    > # int(s, base) was not available in 1.5.2


    Thanks! Yeah, I tried using int(str, base) on Python 1.5.2, and I was
    too lazy to look for the alternative when I was able to do that quick
    and dirty eval() thingy :)



    \\ jonas galvez
    // jonasgalvez.com
     
    Jonas Galvez, Jun 8, 2004
    #7
  8. Bonjour !

    Vous devez décoder chaque portion du sujet délimitées par =? ... ?=
    puis assembler le tout.



    Hi !

    For each block, begin/end, by =? ... ?= you DO decode,
    then, join the results.



    *sorry for my poor english*



    @-salutations
    --
    Michel Claveau
    mél : http://cerbermail.com/?6J1TthIa8B
     
    Michel Claveau/Hamster, Jun 8, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James
    Replies:
    1
    Views:
    410
    Jeff Woodie
    Aug 4, 2003
  2. Replies:
    14
    Views:
    690
  3. Laszlo Nagy

    Decode email subjects into unicode

    Laszlo Nagy, Mar 18, 2008, in forum: Python
    Replies:
    1
    Views:
    373
    Jeffrey Froman
    Mar 18, 2008
  4. Laszlo Nagy

    Re: Decode email subjects into unicode

    Laszlo Nagy, Mar 18, 2008, in forum: Python
    Replies:
    3
    Views:
    570
    Laszlo Nagy
    Mar 19, 2008
  5. Stephan Mueller

    howto decode encoded mail subjects?

    Stephan Mueller, Aug 27, 2007, in forum: Ruby
    Replies:
    4
    Views:
    147
    Stephan Mueller
    Aug 29, 2007
Loading...

Share This Page