separating attribution, quoted text, and sigs from the body of a post

Discussion in 'Perl Misc' started by Art Merkel, Jan 17, 2007.

  1. Art Merkel

    Art Merkel Guest

    I wonder if anyone would be willing to share some code for pulling out
    the "meat" of the body of an e-mail or usenet post? I mean given the
    example

    =====begin example
    On 01/16/07 Fred Smith wrote:
    > blah blah


    Foo bar! Foo foo bar!

    > blah blah blah


    That's all I have to say

    --
    Here's my witty sig.
    =====end example

    just to return this:

    Foo bar! Foo foo bar!
    That's all I have to say


    I'm thinking of something involving while and the .. operator, but I'm
    not sure how to get rid of the "...wrote:"-type line without screwing
    up on posts that don't have one, or what pattern to use to catch the
    common ones.
     
    Art Merkel, Jan 17, 2007
    #1
    1. Advertising

  2. Art Merkel

    Guest

    Art Merkel wrote:
    > I wonder if anyone would be willing to share some code for pulling out
    > the "meat" of the body of an e-mail or usenet post?


    You won't be able to do this 100% of the time because the behavior of
    replies is different (and can be customized) in different newsreaders.
    Usenet posts are plain text, and lack the context tagging of XML, etc.
    But you can probably get pretty close to what you want.

    You can probably exclude 90%+ of attribution lines by excluding
    /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
    that assumes English-language newsgroups. Some folks try to be cute
    with attribution lines like:
    When Art Merkel finally sobered up, he blundered:
    Nuthin you can do about attribution lines like that, unless you
    hard-code distinctive strings for prolific posters.

    You can probably exclude 90%+ of context quotes by excluding /^>/.

    A usenet sig (if it's properly configured) follows a cutline which is
    two dashes and a space. It's easy to identify such a cutline and
    ignore everything which follows. But many posters don't use a proper
    cutline.

    --
    The best way to get a good answer is to ask a good question.
    David Filmer (http://DavidFilmer.com)
     
    , Jan 17, 2007
    #2
    1. Advertising

  3. Art Merkel

    Art Merkel Guest

    Re: separating attribution, quoted text, and sigs from the body of a

    wrote:

    > You can probably exclude 90%+ of attribution lines by excluding
    > /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
    > that assumes English-language newsgroups. Some folks try to be cute
    > with attribution lines like:
    > When Art Merkel finally sobered up, he blundered:
    > Nuthin you can do about attribution lines like that, unless you
    > hard-code distinctive strings for prolific posters.


    How about storing lines (some people's attributin lines wrap) that
    don't match /^>/ until

    (1) I hit one that does match, and I discard what I've already got
    or
    (2) I hit the sig cutline or end of the message, in this case I keep
    everything I've already got since it's probably an OP?

    Not sure what to do about top-posting (b*st*rds) though!


    > You can probably exclude 90%+ of context quotes by excluding /^>/.


    Of course.

    > A usenet sig (if it's properly configured) follows a cutline which is
    > two dashes and a space. It's easy to identify such a cutline and
    > ignore everything which follows. But many posters don't use a proper
    > cutline.


    Right --- when I hit /^-- $/ , stop there.
     
    Art Merkel, Jan 18, 2007
    #3
  4. Art Merkel

    Art Merkel Guest

    Re: separating attribution, quoted text, and sigs from the body of a

    wrote:

    > You won't be able to do this 100% of the time because the behavior of
    > replies is different (and can be customized) in different newsreaders.
    > Usenet posts are plain text, and lack the context tagging of XML, etc.
    > But you can probably get pretty close to what you want.
    >
    > You can probably exclude 90%+ of attribution lines by excluding
    > /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
    > that assumes English-language newsgroups. Some folks try to be cute
    > with attribution lines like:
    > When Art Merkel finally sobered up, he blundered:
    > Nuthin you can do about attribution lines like that, unless you
    > hard-code distinctive strings for prolific posters.
    >
    > You can probably exclude 90%+ of context quotes by excluding /^>/.


    I'm thinking of something "stateful" in which I scan lines until

    (1) I hit a line that starts with '>', in which case I discard
    everything I have so far (attribution). Then I keep going, ignoring
    /^>/ lines (quoted) but keeping other lines until I hit the cutline or
    the end.

    (2) I hit the cutline or the end, in which case I keep everything so
    far (an OP).


    > A usenet sig (if it's properly configured) follows a cutline which is
    > two dashes and a space. It's easy to identify such a cutline and
    > ignore everything which follows. But many posters don't use a proper
    > cutline.


    No way to deal with top-posting, is there?
     
    Art Merkel, Jan 19, 2007
    #4
  5. Art Merkel

    Adam Funk Guest

    On 2007-01-17, wrote:

    > You won't be able to do this 100% of the time because the behavior of
    > replies is different (and can be customized) in different newsreaders.
    > Usenet posts are plain text, and lack the context tagging of XML, etc.
    > But you can probably get pretty close to what you want.
    >
    > You can probably exclude 90%+ of attribution lines by excluding
    > /wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
    > that assumes English-language newsgroups. Some folks try to be cute
    > with attribution lines like:
    > When Art Merkel finally sobered up, he blundered:
    > Nuthin you can do about attribution lines like that, unless you
    > hard-code distinctive strings for prolific posters.


    Here's something I've tinkered with, which assumes that either the
    body is all original (no m/^>/ lines) or that all unquoted lines
    before the first quoted one are attribution lines (I think this is
    almost always the case for inline/bottom-posting).

    Comments, suggestions?

    Of course it doesn't handle top-posting!


    ##################################################
    #!/usr/bin/perl

    use strict;
    use warnings;
    use Getopt::Std;
    use News::Article;

    my ($filename, $in_art, $out_art, $out_filename);

    while (@ARGV) {
    $filename = shift(@ARGV);
    $in_art = News::Article->new($filename);

    print("*****\n$filename\n");

    process_body($in_art->body());
    }


    sub process_body {
    my @input = @_;
    my @output = ();
    my $op = 1;
    my $line;
    my $not_sig = 1;

    # $op true IFF this is an original post (with no quoting)
    foreach $line (@input) {
    if ($line =~ /^>/) {
    $op = 0;
    last;
    }
    elsif ($line =~ /^-- /) {
    last;
    }
    }

    if ($op) {
    print("original\n");
    }
    else {
    print("quoting\n");
    }


    # copy the attribution lines
    if (! $op) {
    do {
    $line = shift(@input);
    print(" a $line\n"); # attribution
    } while ($line !~ /^>/ );
    }

    while (@input && $not_sig) {
    $line = shift(@input);
    if ($line =~ /^-- /) {
    $not_sig = 0;
    print(" - "); # sig separator
    }
    elsif ($line !~ /^>/) {
    print("n "); # new content

    }
    else {
    print(" q "); # quoted
    }
    print($line, "\n");
    }

    while (@input) {
    $line = shift(@input);
    print(" s $line\n"); # sig
    }

    }
     
    Adam Funk, Feb 6, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?S2VuTGVl?=
    Replies:
    1
    Views:
    7,795
    Mark Newmister
    Feb 13, 2006
  2. Knoppix User
    Replies:
    3
    Views:
    297
    Knoppix User
    Jan 11, 2004
  3. Douglas Alan
    Replies:
    0
    Views:
    249
    Douglas Alan
    Jun 14, 2007
  4. skyshade
    Replies:
    1
    Views:
    332
    skyshade
    Oct 19, 2010
  5. Lew
    Replies:
    7
    Views:
    286
Loading...

Share This Page