separating attribution, quoted text, and sigs from the body of a post

A

Art Merkel

I wonder if anyone would be willing to share some code for pulling out
the "meat" of the body of an e-mail or usenet post? I mean given the
example

=====begin example
blah blah

Foo bar! Foo foo bar!
blah blah blah

That's all I have to say

--
Here's my witty sig.
=====end example

just to return this:

Foo bar! Foo foo bar!
That's all I have to say


I'm thinking of something involving while and the .. operator, but I'm
not sure how to get rid of the "...wrote:"-type line without screwing
up on posts that don't have one, or what pattern to use to catch the
common ones.
 
U

usenet

Art said:
I wonder if anyone would be willing to share some code for pulling out
the "meat" of the body of an e-mail or usenet post?

You won't be able to do this 100% of the time because the behavior of
replies is different (and can be customized) in different newsreaders.
Usenet posts are plain text, and lack the context tagging of XML, etc.
But you can probably get pretty close to what you want.

You can probably exclude 90%+ of attribution lines by excluding
/wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
that assumes English-language newsgroups. Some folks try to be cute
with attribution lines like:
When Art Merkel finally sobered up, he blundered:
Nuthin you can do about attribution lines like that, unless you
hard-code distinctive strings for prolific posters.

You can probably exclude 90%+ of context quotes by excluding /^>/.

A usenet sig (if it's properly configured) follows a cutline which is
two dashes and a space. It's easy to identify such a cutline and
ignore everything which follows. But many posters don't use a proper
cutline.
 
A

Art Merkel

You can probably exclude 90%+ of attribution lines by excluding
/wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
that assumes English-language newsgroups. Some folks try to be cute
with attribution lines like:
When Art Merkel finally sobered up, he blundered:
Nuthin you can do about attribution lines like that, unless you
hard-code distinctive strings for prolific posters.

How about storing lines (some people's attributin lines wrap) that
don't match /^>/ until

(1) I hit one that does match, and I discard what I've already got
or
(2) I hit the sig cutline or end of the message, in this case I keep
everything I've already got since it's probably an OP?

Not sure what to do about top-posting (b*st*rds) though!

You can probably exclude 90%+ of context quotes by excluding /^>/.

Of course.
A usenet sig (if it's properly configured) follows a cutline which is
two dashes and a space. It's easy to identify such a cutline and
ignore everything which follows. But many posters don't use a proper
cutline.

Right --- when I hit /^-- $/ , stop there.
 
A

Art Merkel

You won't be able to do this 100% of the time because the behavior of
replies is different (and can be customized) in different newsreaders.
Usenet posts are plain text, and lack the context tagging of XML, etc.
But you can probably get pretty close to what you want.

You can probably exclude 90%+ of attribution lines by excluding
/wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
that assumes English-language newsgroups. Some folks try to be cute
with attribution lines like:
When Art Merkel finally sobered up, he blundered:
Nuthin you can do about attribution lines like that, unless you
hard-code distinctive strings for prolific posters.

You can probably exclude 90%+ of context quotes by excluding /^>/.

I'm thinking of something "stateful" in which I scan lines until

(1) I hit a line that starts with '>', in which case I discard
everything I have so far (attribution). Then I keep going, ignoring
/^>/ lines (quoted) but keeping other lines until I hit the cutline or
the end.

(2) I hit the cutline or the end, in which case I keep everything so
far (an OP).

A usenet sig (if it's properly configured) follows a cutline which is
two dashes and a space. It's easy to identify such a cutline and
ignore everything which follows. But many posters don't use a proper
cutline.

No way to deal with top-posting, is there?
 
A

Adam Funk

You won't be able to do this 100% of the time because the behavior of
replies is different (and can be customized) in different newsreaders.
Usenet posts are plain text, and lack the context tagging of XML, etc.
But you can probably get pretty close to what you want.

You can probably exclude 90%+ of attribution lines by excluding
/wrote:$/ (but it won't work for Dr.Ruud's posts, etc). Of course,
that assumes English-language newsgroups. Some folks try to be cute
with attribution lines like:
When Art Merkel finally sobered up, he blundered:
Nuthin you can do about attribution lines like that, unless you
hard-code distinctive strings for prolific posters.

Here's something I've tinkered with, which assumes that either the
body is all original (no m/^>/ lines) or that all unquoted lines
before the first quoted one are attribution lines (I think this is
almost always the case for inline/bottom-posting).

Comments, suggestions?

Of course it doesn't handle top-posting!


##################################################
#!/usr/bin/perl

use strict;
use warnings;
use Getopt::Std;
use
my ($filename, $in_art, $out_art, $out_filename);

while (@ARGV) {
$filename = shift(@ARGV);
$in_art =
print("*****\n$filename\n");

process_body($in_art->body());
}


sub process_body {
my @input = @_;
my @output = ();
my $op = 1;
my $line;
my $not_sig = 1;

# $op true IFF this is an original post (with no quoting)
foreach $line (@input) {
if ($line =~ /^>/) {
$op = 0;
last;
}
elsif ($line =~ /^-- /) {
last;
}
}

if ($op) {
print("original\n");
}
else {
print("quoting\n");
}


# copy the attribution lines
if (! $op) {
do {
$line = shift(@input);
print(" a $line\n"); # attribution
} while ($line !~ /^>/ );
}

while (@input && $not_sig) {
$line = shift(@input);
if ($line =~ /^-- /) {
$not_sig = 0;
print(" - "); # sig separator
}
elsif ($line !~ /^>/) {
print("n "); # new content

}
else {
print(" q "); # quoted
}
print($line, "\n");
}

while (@input) {
$line = shift(@input);
print(" s $line\n"); # sig
}

}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top