Matching spaces at start of line

P

Paul Jones

Perhaps someone can help me with a little perl problem.

I want to take an message and write it out to a file, but I only
want specific header information. I have included what I have
got below (ripped from a mail2news script on the web), however
the comments header is often split over multiple lines (usually
4) and I want to include them all. The 2-4th lines of the header
all begin with spaces (usually 6 - perhaps its a tab).

The program does exactly what I want of it, except for missing
the lines beginning with a space.




open (INEWS, "| $program $options") ||
die "$program: can't run $news_poster_program\n";

# header loop
while (<STDIN>) {
last if /^$/;

s/(?i)^Date/Date/;
s/(?i)^From/From/;
s/(?i)^Subject/Subject/;
s/(?i)^Newsgroups/Newsgroups/;
s/(?i)^Comments/Comments/;

print INEWS

if /^(Date|From|Subject|Newsgroups|Comments):/i;
$saw_newsgroup |= ( $+ eq 'Newsgroups' );

}

die "$program: didn't get newsgroup from headers\n"
unless $saw_newsgroup;

print INEWS "\n";

print INEWS while <STDIN>; # gobble rest of message
 
D

Dr.Ruud

Paul Jones schreef:
Perhaps someone can help me with a little perl problem.

I want to take an message and write it out to a file, but I only
want specific header information. I have included what I have
got below (ripped from a mail2news script on the web), however
the comments header is often split over multiple lines (usually
4) and I want to include them all. The 2-4th lines of the header
all begin with spaces (usually 6 - perhaps its a tab).

The program does exactly what I want of it, except for missing
the lines beginning with a space.


use strict;
use warnings;
open (INEWS, "| $program $options") ||
die "$program: can't run $news_poster_program\n";

# header loop
while (<STDIN>) {
last if /^$/;

You process line-by-line, but some header fields are folded. Reading the
whole header at once will make things simpler, you won't even need the
while-loop. See perldoc perlvar, specifically $/, the input record
separator.

local $/ = '';

When the header is in $_, you can use 's/\n\s+/ /g' to unfold the folded
header fields.
(err, that \s needs to be [[:blank:]])

After that you can remove the header fields that you don't want:

s/\n(?!(Date|From|Subject|Newsgroups|Comments):).*//gi

s/(?i)^Date/Date/;
s/(?i)^From/From/;
s/(?i)^Subject/Subject/;
s/(?i)^Newsgroups/Newsgroups/;
s/(?i)^Comments/Comments/;

It is better to include the ':' too:

s/^date:/Date:/mi;


Basically:

#!/usr/bin/perl

use strict;
use warnings;

{ local ($/, $\) = ('', "\n\n");

# read the header
$_ = <STDIN>;

# unfold header fields
s/\n[[:blank:]]+/ /g;

# remove unwanted header fields
s/\n(?!(Date|From|Subject|etc.):).*//gi;

# standardize header field names
s/^date:/Date:/mi;
# etc.

print;

# read and print the body
undef $/;
undef $\;
$_ = <STDIN>;
print;
}
 
T

Tad McClellan

Paul Jones said:
the comments header is often split over multiple lines (usually
4) and I want to include them all. The 2-4th lines of the header
all begin with spaces (usually 6 - perhaps its a tab).

print INEWS

if /^(Date|From|Subject|Newsgroups|Comments):/i;


if /^(Date|From|Subject|Newsgroups|Comments):/i or /^\s/;
 
J

John W. Krahn

Tad said:
if /^(Date|From|Subject|Newsgroups|Comments):/i or /^\s/;

That is going to print from every header that is folded, not just the Comments
header. :-(


John
 
J

John W. Krahn

Paul said:
Perhaps someone can help me with a little perl problem.

I want to take an message and write it out to a file, but I only
want specific header information. I have included what I have
got below (ripped from a mail2news script on the web), however
the comments header is often split over multiple lines (usually
4) and I want to include them all. The 2-4th lines of the header
all begin with spaces (usually 6 - perhaps its a tab).

The program does exactly what I want of it, except for missing
the lines beginning with a space.



open (INEWS, "| $program $options") ||
die "$program: can't run $news_poster_program\n";

# header loop
while (<STDIN>) {
last if /^$/;

s/(?i)^Date/Date/;
s/(?i)^From/From/;
s/(?i)^Subject/Subject/;
s/(?i)^Newsgroups/Newsgroups/;
s/(?i)^Comments/Comments/;

print INEWS

if /^(Date|From|Subject|Newsgroups|Comments):/i;
$saw_newsgroup |= ( $+ eq 'Newsgroups' );

}

die "$program: didn't get newsgroup from headers\n"
unless $saw_newsgroup;

print INEWS "\n";

print INEWS while <STDIN>; # gobble rest of message

Here is one way to do it:


open INEWS, '|-', split( ' ', "$program $options" )
or die "$program: can't run $news_poster_program: $!";

my $saw_newsgroup;
while ( <STDIN> ) {
last if /^$/;
if ( s/^(comments:)/\u\L$1/i ) {
my $line = <>;
if ( $line =~ /^\s/ ) {
$_ .= $line;
}
else {
print INEWS;
$_ = $line;
}
redo;
}
elsif ( s/^(newsgroups:)/\u\L$1/i ) {
$saw_newsgroup++;
}
elsif ( ! s/^(date|from|subject):/\u\L$1:/i ) {
next;
}
print INEWS;
}

die "$program: didn't get newsgroup from headers\n"
unless $saw_newsgroup;

print INEWS "\n";
print INEWS while <STDIN>;

close INEWS
or warn $! ? "Error closing $program: $!"
: "Exit status $? from $program";

__END__



John
 
A

Anno Siegel

John W. Krahn said:
That is going to print from every header that is folded, not just the Comments
header. :-(

I think it should, for those headers that are to be printed at all. I
know that OP singled out the Comments header, but that's probably just
because it's the only one that *has* continuation lines with any regularity.

I read the problem as "Print all valid headers ( Date, From, ...) and
their continuation lines, suppress all others, including their continuation
lines". Could be done like this:

my $saw_newsgroup;
my $valid_header;
while (<DATA>) {
last if /^$/;

if ( /^\S/ ) {
$valid_header =
s{^(Date|From|Subject|Newsgroups|Comments):}
{\u\L$1:}i;
$saw_newsgroup ||= $valid_header && $1 eq 'Newsgroups';
}
print if $valid_header;
}
die "didn't get newsgroup from headers\n" unless $saw_newsgroup;

print "\n";
print while <DATA>; # gobble rest of message

Besides adding the logic to control $valid_header, I have used the
s/// operation that normalizes header spelling to determine if
we have a valid header. This way, the list of valid headers appears
only once in the code, while the original code had it twice. A
maintainer will appreciate that.

Anno
 
P

Paul Jones

Paul Jones schreef:



use strict;
use warnings;


You process line-by-line, but some header fields are folded. Reading the
whole header at once will make things simpler, you won't even need the
while-loop. See perldoc perlvar, specifically $/, the input record
separator.


<snip great advice>


Basically:

Thanks very much!

Unfortunately I chopped my script down a bit to post it on here,
thinking that as some of it is in essence just repeated, or so I
thought, I needent bother you with it all (it's really about 2-
3x the length of what I posted). Unfortunately my first attempt
at expanding what you have given me failed to work fairly
miserably!

I will try a little longer at getting it to work, but if I still
have problems I'll post the whole thing and maybe you can point
out what I'm doing wrong. I'm very new to perl and all I know is
what I've managed to get from the web or from other peoples
annotated scripts.
 
P

Paul Jones

I think it should, for those headers that are to be printed at all. I
know that OP singled out the Comments header, but that's probably just
because it's the only one that *has* continuation lines with any regularity.

I read the problem as "Print all valid headers ( Date, From, ...) and
their continuation lines, suppress all others, including their continuation
lines". Could be done like this:

Perhaps a better way of approaching the problem would have been
to print all headers except a list that should be excluded.

Anyway, maybe I should have just listed the entire requirements
of what I wanted, rather than show you what I've got and ask how
to do it better. So here we go:

The message is routed via Exim4 in it's transport phase.

mail2news_transport:
debug_print = "T: mail2news for $local_part@$domain"
driver = pipe
command = /usr/local/bin/mail2news.pl
user = mail

It's then picked up by the perl script.

The perl script starts with:



$news_poster_program = "/usr/bin/rnews";
$news_poster_options = "-r localhost";

open (INEWS, "| $news_poster_program $news_poster_options") ||
die "$program: can't run $news_poster_program\n";



Then it goes on to work on the headers. I would like to exclude:

Control:
Supersedes:
Authorised:
Received:

(Note Received: can go over multiple lines)

Certain headers need to be supplied or it fails:

From:
Newsgroups:
Date:


Some are required but can be created if absent:

Subject:
Message-ID:
Organization:

(I think I've got that bit sorted)

($sec,$min,$hour,$mday,$mon,$year)=localtime(time);
$madeupid = "\<$year$mon$mday.$hour$min$sec.$$\@myhost.com\>";

print INEWS "Subject: Untitled\n" unless $saw_subject;
printf INEWS "Message-ID: %s\n", $madeupid unless $saw_msgid;
print INEWS "Organization: Unknown\n" unless $saw_organization;


Some must be created (removing any that already exist):

print INEWS "Path: posted\n";


Others should be printed, unless excluded.

One final thing - it would be nice to grab the IP address from
the first Received header to add it to an NNTP-Posting-Host:
header.

This is basically a mail2news posting perl script. The ones
already out there dind't seem to work as well as I liked, so I
tried fiddling with them to improve them. I've got it working,
however it could be (much) better (because I know next to
nothing about PERL).

Anyway, thanks for the helps so far. I'll keep tinkering. But I
would be happy to receive any more tips!
 
J

John W. Krahn

Anno said:
I think it should, for those headers that are to be printed at all. I
know that OP singled out the Comments header, but that's probably just
because it's the only one that *has* continuation lines with any regularity.

I read the problem as "Print all valid headers ( Date, From, ...) and
their continuation lines, suppress all others, including their continuation
lines". Could be done like this:

my $saw_newsgroup;
my $valid_header;
while (<DATA>) {
last if /^$/;

if ( /^\S/ ) {
$valid_header =
s{^(Date|From|Subject|Newsgroups|Comments):}
{\u\L$1:}i;
$saw_newsgroup ||= $valid_header && $1 eq 'Newsgroups';

If the original header is not exactly 'Newsgroups', for example 'newsgroups'
or 'NewsGroups', then that expression will fail. Perhaps:

$saw_newsgroup ||= $valid_header && lc( $1 ) eq 'newsgroups';
}
print if $valid_header;
}
die "didn't get newsgroup from headers\n" unless $saw_newsgroup;

print "\n";
print while <DATA>; # gobble rest of message

Besides adding the logic to control $valid_header, I have used the
s/// operation that normalizes header spelling to determine if
we have a valid header. This way, the list of valid headers appears
only once in the code, while the original code had it twice. A
maintainer will appreciate that.


John
 
A

Anno Siegel

John W. Krahn said:
Anno Siegel wrote:
[...]
$valid_header =
s{^(Date|From|Subject|Newsgroups|Comments):}
{\u\L$1:}i;
$saw_newsgroup ||= $valid_header && $1 eq 'Newsgroups';

If the original header is not exactly 'Newsgroups', for example 'newsgroups'
or 'NewsGroups', then that expression will fail. Perhaps:

$saw_newsgroup ||= $valid_header && lc( $1 ) eq 'newsgroups';
}
print if $valid_header;
}
die "didn't get newsgroup from headers\n" unless $saw_newsgroup;

You are right, I noticed that bug myself after posting. I thought of the
same fix too.

It is remarkable (and comforting) how carefully code is read that is
posted here.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top