Decoding 'funky' e-mail subjects

J

Jonas Galvez

Hi, I need a function to parse badly encoded 'Subject' headers from
e-mails, such as the following:

=?ISO-8859-1?Q?Murilo_Corr=EAa?=
=?ISO-8859-1?Q?Marcos_Mendon=E7a?=

I tried using the decode() method from mimetools but that doesn't
appear to be correct solution. I ended up coding the following:

import re

subject = "=?ISO-8859-1?Q?Murilo_Corr=EAa?="
subject = re.search("(?:=\?[^\?]*\?\Q\?)?(.*)\?=", subject)
subject = subject.group(1)

def decodeEntity(str):
str = str.group(1)
try: return eval('"\\x%s"' % str)
except: return "?"

subject = re.sub("=([^=].)", decodeEntity, subject)
print subject.replace("_", " ").decode("iso-8859-1")

Can anyone recommend a safer method?

Tia,



\\ jonas galvez
// jonasgalvez.com
 
O

Oliver Kurz

Have you tried decode_header from email.Header in the python email-package?



Best regards,

Oliver
 
J

Jonas Galvez

Oliver said:
Have you tried decode_header from email.Header
in the python email-package?

Thanks, that works. The problem is that I need to make it compatible
with Python 1.5.2. I improved my regex-based method and it has worked
fine with all my test cases so far. But if anyone has any other
suggestion, I'm still interested. Anyway, here's my code:

import re
from string import *

def decodeHeader(h):
def firstGroup(s):
if s.group(1): return s.group(1)
return s.group()
h = re.compile("=\?[^\?]*\?q\?", re.I).sub("", h)
h = re.compile(
"=\?(?:(?:(?:(?:(?:(?:(?:(?:w)?i)?n)?d)?o)?w)?s)?|"
"(?:(?:(?:i)?s)?o)?|(?:(?:(?:u)?t)?f)?)"
"[^\.]*?(\.\.\.)?$",
re.I).sub(firstGroup, h)
h = re.sub("=.(\.\.\.)?$", firstGroup, h)
def isoEntities(str):
str = str.group(1)
try: return eval('"\\x%s"' % str)
except: return "?"
h = re.sub("=([^=].)", isoEntities, h)
if h[-2:] == "?=": h = h[:-2]
return replace(h, "_", " ")

print decodeHeader("=?ISO-8859-1?Q?Marcos_Mendon=E7a?=")
print decodeHeader("=?ISO-8859-1?Q?Test?=")
print decodeHeader("=?UTF-8?Q?Test?=")
print decodeHeader("Test =?windows-125...")
print decodeHeader("Test =?window-125...")
print decodeHeader("Test =?windo-1...")
print decodeHeader("Test =?wind...")
print decodeHeader("Test =?...")
print decodeHeader("Test =?w...")
print decodeHeader("Test =?iso...")




\\ jonas galvez
// jonasgalvez.com
 
S

Skip Montanaro

Jonas> Thanks, that works. The problem is that I need to make it
Jonas> compatible with Python 1.5.2.

Why not just include email.Header.decode_header() in your app? Something
like:

try:
from email.Header import decode_header
except ImportError:
# Python 1.5.2 compatibility...
def decode_header(...):
...

If that proves to be intractible, define yours when an ImportError is
raised. In either case, you get the best solution when you can and only
fall back to something possibly suboptimal when necessary.

Skip
 
P

Paul Rubin

Jonas Galvez said:
Thanks, that works. The problem is that I need to make it compatible
with Python 1.5.2. I improved my regex-based method and it has worked
fine with all my test cases so far. But if anyone has any other
suggestion, I'm still interested. Anyway, here's my code:

A lot of those funny subjects come from spammers. Never eval anything
from anyone like that!!!
 
C

Christos TZOTZIOY Georgiou

A lot of those funny subjects come from spammers. Never eval anything
from anyone like that!!!

(The part of the code that caused Paul's comment):

try: return eval('"\\x%s"' % str)
except: return "?"

A sound advice by Paul. However, lots of those funny subjects come in
legitimate e-mails from countries where the ascii range is not enough.

So, a safer alternative to the code above is:

try: return string.atoi(str, 16)
except: return '?'
# int(s, base) was not available in 1.5.2
 
J

Jonas Galvez

Paul said:
A lot of those funny subjects come from spammers. Never eval
anything from anyone like that!!!

Hi Paul, yeah, actually, that kind of 'funky' subject is very common
on mailing-lists here in Brazil (where ISO-8859-1 is the standard). A
lot of people use crappy webmail software which spills out that kind
of mess. So I'm forced to deal with it :)

By the way, this is for a mail2rss application which will enable easy
removal/blacklisting of spam, among other things.
A sound advice by Paul. However, lots of those funny subjects come
in legitimate e-mails from countries where the ascii range is not
enough. So, a safer alternative to the code above is:

try: return string.atoi(str, 16)
except: return '?'
# int(s, base) was not available in 1.5.2

Thanks! Yeah, I tried using int(str, base) on Python 1.5.2, and I was
too lazy to look for the alternative when I was able to do that quick
and dirty eval() thingy :)



\\ jonas galvez
// jonasgalvez.com
 
M

Michel Claveau/Hamster

Bonjour !

Vous devez décoder chaque portion du sujet délimitées par =? ... ?=
puis assembler le tout.



Hi !

For each block, begin/end, by =? ... ?= you DO decode,
then, join the results.



*sorry for my poor english*



@-salutations
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top