Recursively Parsing through multipart messages use Mail::Box::Manager;

Discussion in 'Perl Misc' started by Bloch, Dec 21, 2005.

  1. Bloch

    Bloch Guest

    I've written a little script that uses mailbox manager to parse an mbox
    file, strip off most of the headers, decode the body, and eventually print
    the data that is encoded as text/plain. It works fine for messages that
    are flat (i.e., multipart/alternative on the top level) and it can just
    grab the plaintext attachments from 1 level down.

    I run into problems when I hit multipart/mixed messages and I have to
    descend down a level. I've been reading through the groups.google.com
    archives and and the man pages of these modules and see that applying
    these items recursively is tricky for inexperienced programmers -- which
    I claim to be. Can someone recommend a better way of getting to my desired
    endpoint, or help me sort out how to get there using my existing approach?

    I've attached the relevent portion of my code and the output of
    printStructure to give a better idea of the problem domain.

    #!/usr/local/bin/perl

    use Mail::Box::Manager;
    use Date::parse;
    use warnings;
    use strict;
    my $mgr = Mail::Box::Manager->new;
    #my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
    "/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
    $folder_file) or die "Could not open folder $!n";
    my(@subject,@sender,@body,@time);
    my $x = 0;
    for ($folder->messages) {
    $subject[$x] = $_->subject;
    $sender[$x] = $_->sender->address;
    $time[$x] = $_->get('Date');
    #body[$x] = $decode = $_->decoded;
    #$_->printStructure;

    if($_->isMultipart) {
    foreach my $part($_->body->parts) {
    my $attached_head = $part->head;
    my $attached_body = $part->decoded;
    if($attached_head =~ /text\/plain/) {
    # print "$attached_head\n";
    # print "OK\n";
    }elsif($attached_head =~ /multipart\/alternative/i) {
    print "$attached_head\n";
    print "Crap\! How do I parse the next batch of headers?\n"; print
    "$attached_body";
    }
    }
    }
    $x++;
    }

    PARTIAL OUTPUT OF MESSAGE STRUCTURES:

    OK:
    multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
    REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
    text/plain (47689 bytes)
    text/html (62436 bytes)

    OK:
    multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
    Offense' Comment (10116 bytes)
    text/plain (2647 bytes)
    text/html (5495 bytes)

    OK:
    multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
    Bill (8876 bytes)
    text/plain (1030 bytes)
    text/html (5864 bytes)

    FAILS TO PARSE PROPERLY:
    multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID THE
    TRUTH OF THE WAR (202224 bytes)
    multipart/alternative (146945 bytes)
    text/plain (54877 bytes)
    text/html (91778 bytes)
    application/msexcel (53598 bytes)

    ....
     
    Bloch, Dec 21, 2005
    #1
    1. Advertising

  2. Bloch

    Bloch Guest

    GEEEEEEYYYYYAAAARGH!!!

    foreach my $part($_->body->parts('RECURSE'))

    was the option that I was looking for. Missed it in the documentation
    (several times, I might add).

    For what it's worth, I place the blame entirely on Mark Overmeer, who
    spent godknowshowlong writing and documenting this excellent module.
    Mark, if you hadn't been so thorough, I would never have missed such an
    important, easily-spotted detail. No, no, this has nothing to do with the
    fact that I'm an American, weaned on television and raised in the age of
    instant gratification. Nor with the fact that my iq is roughly 200 points
    lower than a sponge -- and not one of those real sponges either, I'm
    talking a sponge made by 3M or Dow or someone. No, it's your fault.

    And that goes for the lot of you Perl mongers who have contributed to
    developing Perl, and in so-doing, have helped to build the modern
    internet, or rather, "internets" as our President so eloquently puts it.
    You owe me something. I could be using Smalltalk, or Eiffel, or Scheme or
    Visual Basic or something, but I chose Perl. Okay, admittedly, Perl
    *might* be slightly better than those languages for the problem domains
    that I usually look at -- parsing textfiles and playing around with *nixy
    stuff and so on. But many of my former CS professors *insist* that it's
    ugly -- so it must be true -- so, again, you owe me for giving me such a
    cool language to play with for for free -- as in free beer and free
    speech.

    ;-)

    On Wed, 21 Dec 2005 01:06:35 +0000, Bloch wrote:

    > I've written a little script that uses mailbox manager to parse an mbox
    > file, strip off most of the headers, decode the body, and eventually
    > print the data that is encoded as text/plain. It works fine for
    > messages that are flat (i.e., multipart/alternative on the top level)
    > and it can just grab the plaintext attachments from 1 level down.
    >
    > I run into problems when I hit multipart/mixed messages and I have to
    > descend down a level. I've been reading through the groups.google.com
    > archives and and the man pages of these modules and see that applying
    > these items recursively is tricky for inexperienced programmers -- which
    > I claim to be. Can someone recommend a better way of getting to my
    > desired endpoint, or help me sort out how to get there using my existing
    > approach?
    >
    > I've attached the relevent portion of my code and the output of
    > printStructure to give a better idea of the problem domain.
    >
    > #!/usr/local/bin/perl
    >
    > use Mail::Box::Manager;
    > use Date::parse;
    > use warnings;
    > use strict;
    > my $mgr = Mail::Box::Manager->new;
    > #my $folder_file = "/home/salvador/mail/releases"; my $folder_file =
    > "/home/salvador/mail/releases.old"; my $folder = $mgr->open(folder =>
    > $folder_file) or die "Could not open folder $!n";
    > my(@subject,@sender,@body,@time);
    > my $x = 0;
    > for ($folder->messages) {
    > $subject[$x] = $_->subject;
    > $sender[$x] = $_->sender->address;
    > $time[$x] = $_->get('Date');
    > #body[$x] = $decode = $_->decoded;
    > #$_->printStructure;
    >
    > if($_->isMultipart) {
    > foreach my $part($_->body->parts) {
    > my $attached_head = $part->head;
    > my $attached_body = $part->decoded;
    > if($attached_head =~ /text\/plain/) {
    > # print "$attached_head\n";
    > # print "OK\n";
    > }elsif($attached_head =~ /multipart\/alternative/i) {
    > print "$attached_head\n";
    > print "Crap\! How do I parse the next batch of headers?\n";
    > print "$attached_body";
    > }
    > }
    > }
    > $x++;
    > }
    >
    > PARTIAL OUTPUT OF MESSAGE STRUCTURES:
    >
    > OK:
    > multipart/alternative: KENNEDY: AMERICANS DESERVE BETTER THAN A
    > REPUBLICAN BUDGET THAT LEAVES THEM BEHIND (111850 bytes)
    > text/plain (47689 bytes)
    > text/html (62436 bytes)
    >
    > OK:
    > multipart/alternative: Boxer Asks Legal Scholars on Dean's 'Impeachable
    > Offense' Comment (10116 bytes)
    > text/plain (2647 bytes)
    > text/html (5495 bytes)
    >
    > OK:
    > multipart/alternative: Sen. Jeffords' Statement on ANWR/Defense Spending
    > Bill (8876 bytes)
    > text/plain (1030 bytes)
    > text/html (5864 bytes)
    >
    > FAILS TO PARSE PROPERLY:
    > multipart/mixed: KENNEDY: REPUBLICANS BLOCK INTELLIGENCE BILL TO AVOID
    > THE TRUTH OF THE WAR (202224 bytes)
    > multipart/alternative (146945 bytes)
    > text/plain (54877 bytes)
    > text/html (91778 bytes)
    > application/msexcel (53598 bytes)
    >
    > ...
     
    Bloch, Dec 21, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. padma
    Replies:
    0
    Views:
    374
    padma
    Oct 3, 2007
  2. Jens Riedel

    Go through directories recursively

    Jens Riedel, May 12, 2005, in forum: Ruby
    Replies:
    16
    Views:
    325
  3. Patrick
    Replies:
    0
    Views:
    343
    Patrick
    Feb 17, 2004
  4. David Combs
    Replies:
    7
    Views:
    132
    Theo van den Heuvel
    May 12, 2005
  5. Replies:
    1
    Views:
    463
Loading...

Share This Page