Outlook Crapolla: The Fakie Breakie

A

afrinspray

I'm writing a MIME::Tools email parsing engine. This utility rocks by
the way... the whole package makes mime processing very easy.

My problem however is with Outlook emails and they're horrible styling.
While normal people will use the <br> tag for line breaks, outlook
likes to do stuff like this:

<DIV dir=ltr align=left><FONT size=2><SPAN
class=3D671020819-14062006></SPAN></FONT>&nbsp;</DIV>

They like to use these weird css classes as well, like
3D671020819-14062006 (which isn't defined anywhere in the document) and
MsoNormal. Also, they like to use random garbage pseudo-breaks here
and there that don't show up in outlook, but show up in every other
html parser I've seen... so I'm using the HTML::Tree class to remove
empty breaks. Uggg... it's just a total mess.

Is there a reliable perl module for converting Outlook garbage into
real HTML?

Thanks,
Mike
 
D

Dr.Ruud

afrinspray schreef:
<DIV dir=ltr align=left><FONT size=2><SPAN
class=3D671020819-14062006></SPAN></FONT>&nbsp;</DIV>

If the whole message is in $_, then you could do

s~ < (SPAN) (?:\s[^>]*)? > \s* < / \1 > ~~xg ;
s~ < (FONT) (?:\s[^>]*)? > \s* < / \1 > ~~xg ;
s~ < (DIV) (?:\s[^>]*)? > \s* &nbsp; \s* < / \1 > ~~xg ;


Test:

echo '
a<DIV dir=ltr align=left><FONT size=2><SPAN
class=3D671020819-14062006></SPAN></FONT>&nbsp;</DIV>b
' | perl -we '

undef $/ ;
$_ = <> ;

s~< (SPAN) (?:\s[^>]*)? > \s* < / \1 > ~~xg ;
s~< (FONT) (?:\s[^>]*)? > \s* < / \1 > ~~xg ;
s~< (DIV) (?:\s[^>]*)? > \s* &nbsp; \s* < / \1 > ~<br>~xg ;
print
'

Prints: a<br>b
 
A

afrinspray

Dr.Ruud said:
s~ < (SPAN) (?:\s[^>]*)? > \s* < / \1 > ~~xg ;
s~ < (FONT) (?:\s[^>]*)? > \s* < / \1 > ~~xg ;
s~ < (DIV) (?:\s[^>]*)? > \s* &nbsp; \s* < / \1 > ~~xg ;

Great idea, but I'm not sure that these tags will always be in the same
order. (i.e. <div><font><span>). I used this HTML::TreeBuilder class
to remove empty tags, which is a solid alternative to what you wrote
above. But I still can't remove these tags outright, just because they
might be important depending on the spacing properties of these mystery
styles. I did some research this morning on the Microsoft CSS class,
MsoNormal, which if the newsgroups are correct, is supposed to make a
newline meaningless. So i guess the real trick it to remove all
MsoNormal tags, then do what you did above with the &nbsp; -> <br>.

Thanks!

Mike
 
A

afrinspray

Reference to anyone viewing this article in the future...

MsoNormal from the "Save as Web Page" feature in Word 2003:

<STYLE>
P.MsoNormal {
FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Times New Roman";
mso-style-parent: ""; mso-pagination: widow-orphan;
mso-fareast-font-family: "Times New Roman"
}
LI.MsoNormal {
FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Times New Roman";
mso-style-parent: ""; mso-pagination: widow-orphan;
mso-fareast-font-family: "Times New Roman"
}
DIV.MsoNormal {
FONT-SIZE: 12pt; MARGIN: 0in 0in 0pt; FONT-FAMILY: "Times New Roman";
mso-style-parent: ""; mso-pagination: widow-orphan;
mso-fareast-font-family: "Times New Roman"
}
</STYLE>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top