On Tue, 30 Dec 2008 12:17:44 +0100, Mirco Wahab wrote:
[snipped and reordered, for thematic reasons]
my hack on your problem:
use strict;
use warnings;
use LWP::Simple;
# load the complete content of the url in question
# via LWP::Simple::get(...)
my $t = get '
http://www.alfrankensense.com/al_franken_quotes.html';
# inspect the web site and look at what "marker"
# your stuff usually starts, in your case - it's the tag:
# <center>Al Franken Quotes</center>
my @quotes; # array, where the quotes are to be collected
# *If* we got there:
if($t =~ /<center>Al Franken Quotes<\/center>/g) { # the inner / is ecscaped
# then we write a quick & dirty regular expression
# to map on the quote (look in the html for hints)
my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by whitespace
\-\s+ # which comes here (escaped -) ..
([^<]+) # this has to be the quote source until next html tag
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this expression
# apply this expression to the text
while($t=~ /$q/g) { # /g in scalar context (look it up)
push @quotes, [$1, $2]; # save found quote on array
} # quote in $1, source in $2
}
print "total: " . scalar @quotes . " quotes found\n";
for my $q (@quotes) { # now show what quotes we found
print_nice($q->[0], $q->[1]) # and format them however you want
}
# thats it
# we need to provide our special formatting subroutine
sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single space
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "$q" # print quote, followed by ...
. "\n" . '-'x40 ."\n" # new line + 40 x '-' + new line
. "- $s\n\n" # '-' + quote source + double \n
}
We're getting real close here.
use strict;
use warnings;
use LWP::Simple;
# load the complete content of the url in question
# via LWP::Simple::get(...)
my $t = get '
http://www.alfrankensense.com/al_franken_quotes.html';
# inspect the web site and look at what "marker"
# your stuff usually starts, in your case - it's the tag:
# <center>Al Franken Quotes</center>
my @quotes; # array, where the quotes are to be collected
# *If* we got there:
if($t =~ /<center>Al Franken Quotes<\/center>/g) { # the inner / is
ecscaped
# then we write a quick & dirty regular expression
# to map on the quote (look in the html for hints)
my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by
whitespace
\-\s+ # which comes here (escaped -) ..
([^<]+) # this has to be the quote source until next
html tag
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this
expression
# apply this expression to the text
while($t=~ /$q/g) { # /g in scalar context (look it up)
push @quotes, [$1, $2]; # save found quote on array
} # quote in $1, source in $2
}
print "total: " . scalar @quotes . " quotes found\n";
for my $q (@quotes) { # now show what quotes we found
print_nice($q->[0], $q->[1]) # and format them however you want
}
# thats it
# we need to provide our special formatting subroutine
sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single
space
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "$q" # print quote, followed by ...
. "\n" # new line
. "~~ $s\n" # '-' + quote source + double \n
. "% \n" # a percentage sign between quotes
}
# perl wahab7.pl >\Program Files\40tude Dialog\sigs\frank1.txt
# perl wahab7.pl >frank1.txt
The output is really close. I can't see any difference between the format
of the two following files
%
A dictatorship would be a heck of a lot easier, there's no question about
it.
George W. Bush
%
After the chaos and carnage of September 11th, it is not enough to serve
our enemies with legal papers.
George W. Bush
%
America is a friend to the people of Iraq. Our demands are directed only at
the regime that enslaves them and threatens us. When these demands are met,
the first and greatest benefit will come to Iraqi men, women and children.
George W. Bush
%
America is a Nation with a mission - and that mission comes from our most
basic beliefs. We have no desire to dominate, no ambitions of empire. Our
aim is a democratic peace - a peace founded upon the dignity and rights of
every man and woman.
George W. Bush
and
%
The biases the media has are much bigger than conservative or liberal.
They're about getting ratings, about making money, about doing stories that
are easy to cover.
~~ Al Franken,
%
[G. W. Bush's] pro-air pollution Clear Skies Initiative is designed to
clear the skies of birds.
~~ Al Franken,
%
And just like in 1984, where the enemy is switched from Eurasia to
Eastasia, Bush switched our enemy from al Qaeda to Iraq. Bush's War on
Terror is a war against whomever Bush wants to be at war with.
~~ Al Franken,
%
Mistakes are a part of being human. Appreciate your mistakes for what they
are: precious life lessons that can only be learned the hard way. Unless
it's a fatal mistake, which, at least, others can learn from.
~~ Al Franken,
%
, but right now, dialog doesn't think that these quotes are delimited by a
percentage sign. Also, we need to remove the commas after Al Franken, when
inappropriate. I think I can manage that tonight.
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
I don't understand what these statements do.
I looked into your html source and conceived something
that does the job somehow. Please try to learn some
basics of regular expressions, eg.: from here:
http://oreilly.com/catalog/9781565922570/
Thanks for your response, M., I'll read up tonight.
--
George
The terrorists and their supporters declared war on the United States - and
war is what they got.
George W. Bush
Picture of the Day
http://apod.nasa.gov/apod/