searching for franken

George · Dec 29, 2008

I've written perl programs now for about a year and a half, and they have
usually focused on usenet. In some sense, this one does as well, in that
I'm developing material for my next 'nym shift, but the input from this is
the internet instead, probably an url.

I've never used perl to do this before, so I don't know where to start,
except with

# shebang windows meaningless

use warnings;
use strict;

I would like to test whether

http://www.co-array.org/

contains the following words:

distributed memory
Numerid Ried
OpenMP

Thanks for your comment, and thank you Santa for sending the cowgirls home
with George.
--
George

This way of life is worth defending.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

Mirco Wahab · Dec 29, 2008

George said:
I've written perl programs now for about a year and a half, and they have
usually focused on usenet. In some sense, this one does as well, in that
I'm developing material for my next 'nym shift, but the input from this is
the internet instead, probably an url.
I've never used perl to do this before, so I don't know where to start,
except with
I would like to test whether
http://www.co-array.org/
contains the following words:
distributed memory
Numerid Ried
OpenMP

There is a module called LWP::Simple
http://search.cpan.org/~gaas/libwww-perl-5.822/lib/LWP/Simple.pm

which does exactly that, example:

...

use LWP::Simple;

my $url = 'www.co-array.org';
my $html = get 'http://' . $url;

my $what = qr'distributed.*?memory|Numerid.*?Ried|OpenMP';

while($html =~ /$what/mg) { # note the /m modifier
print '...' . substr($html, pos($html)-20, 40) . "...\n"
}

...

If you need to pass headers and cookies to the
site, please study the other libwww functions
(http://search.cpan.org/~gaas/libwww-perl-5.822/lib/Net/HTTP.pm)

or copy/paste the complete http query from your browser
and send it via IO::Socket::INET.

Regards

M.

Randal L. Schwartz · Dec 29, 2008

George> I would like to test whether

George> http://www.co-array.org/

George> contains the following words:

George> distributed memory
George> Numerid Ried
George> OpenMP

Go to google. Enter:

site:co-array.org ("distributed memory" OR "Numerid Ried" OR OpenMP)

and see the results.

This is not a Perl problem.

Jürgen Exner · Dec 29, 2008

["searching for franken"]

You can find Franken in the northern part of Bavaria.

I would like to test whether

http://www.co-array.org/

contains the following words:

distributed memory
Numerid Ried
OpenMP

To check if A is an anagram of B you would treat both strings as
multi-sets of characters and check if they are equal.
To check for "contains" simply check for subset instead of equality.

jue

Tad J McClellan · Dec 29, 2008

Mirco Wahab said:
George wrote:

my $what = qr'distributed.*?memory|Numerid.*?Ried|OpenMP';

while($html =~ /$what/mg) { # note the /m modifier

Note that the /m modifier does absolutely nothing for the pattern
being used, and so is not needed at all.

You probably meant the /s modifier instead?

Tad J McClellan · Dec 29, 2008

George said:
I've written perl programs now for about a year and a half, and they have
usually focused on usenet. In some sense, this one does as well, in that
I'm developing material for my next 'nym shift, but the input from this is
the internet instead, probably an url.

Errr, usenet is on the "internet" too.

I expect that you meant that the input is to be from a WWW page.

I've never used perl to do this before, so I don't know where to start,

Start with the Perl Frequently Asked Questions.

Since you are interested in getting an HTML page:

perldoc -q HTML

might hand you the answer in response to this FAQ:

How do I fetch an HTML file?

I would like to test whether

http://www.co-array.org/

contains the following words:

distributed memory
Numerid Ried
OpenMP

--------------------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;

$_ = get 'http://www.co-array.org/';

print "yes\n" if /distributed/ and /memory/
and /Numerid/ and /Ried/ and /OpenMP/;

Randal L. Schwartz · Dec 29, 2008

George> I did try what you suggest above. Was I to paste in the above where I
George> would otherwise do a keyword search?

Like this:

http://letmegooglethatforyou.com/?q...tributed+memory"+OR+"Numerid+Ried"+OR+OpenMP)

Watch and learn.

/me sighs

George · Dec 29, 2008

There is a module called LWP::Simple
http://search.cpan.org/~gaas/libwww-perl-5.822/lib/LWP/Simple.pm

which does exactly that, example:

...

use LWP::Simple;

my $url = 'www.co-array.org';
my $html = get 'http://' . $url;

my $what = qr'distributed.*?memory|Numerid.*?Ried|OpenMP';

while($html =~ /$what/mg) { # note the /m modifier
print '...' . substr($html, pos($html)-20, 40) . "...\n"
}

...

If you need to pass headers and cookies to the
site, please study the other libwww functions
(http://search.cpan.org/~gaas/libwww-perl-5.822/lib/Net/HTTP.pm)

# shebang doesn't work on windows

use strict;
use warnings;

use Net::HTTP;
use LWP::Simple;

my $url = 'www.co-array.org';
my $html = get 'http://' . $url;

my $what = qr'distributed.*?memory|Numerid.*?Ried|OpenMP';

while($html =~ /$what/mg) { # note the /m modifier
print '...' . substr($html, pos($html)-20, 40) . "...\n"
}

my $s = Net::HTTP->new(Host => "www.co-array.org") || die $@;
$s->write_request(GET => "/", 'User-Agent' => "Mozilla/5.0");
my($code, $mess, %h) = $s->read_response_headers;

while (1) {
my $buf;
my $n = $s->read_entity_body($buf, 1024);
die "read failed: $!" unless defined $n;
last unless $n;
print $buf;
}

# perl wahab3.pl

# end script begin abridged output

C:\MinGW\source>perl wahab3.pl
....n distributed memory machines but also o...
....ranslate into OpenMP Fortran</A>
...
....ay Fortran to OpenMP Fortran translator<...
....uivalent SPMD OpenMP Fortran 90/95 progr...
....w.openmp.org">OpenMP Fortran</A> compile...
....tectures
than OpenMP Fortran, programs s...
....anslated into OpenMP as part of the comp...
....ntrinsics for OpenMP is bundled with the...
....anslates into OpenMP Fortran.
<L...
....mp99.ps">SPMD OpenMP vs MPI for Ocean Mo...
....y Fortran and OpenMP Fortran for SPMD Pr...
<HTML>
<HEAD>
<TITLE>Co-Array Fortran</TITLE>
<LINK REL="shortcut icon" HREF="favicon.ico" />
</HEAD>

<BODY BACKGROUND="" BGCOLOR="#ffff99" TEXT="#000000" LINK="#0000ff"
VLINK="#8000
00" ALINK="#ff0000">

<HR WIDTH="30%" ALIGN=CENTER>
<CENTER><B><H1>Co-Array Fortran</H1></B></CENTER>
<HR WIDTH="30%" ALIGN=CENTER>

<P>
Co-array Fortran is a small extension to
Fortran 95.

# end abridged output

Wow, Mirco, with a little guidance, a person can go from square one to
cruising speed *so quickly*. I think this gives me the tools I need to
move my token forward. Thanks so much.

or copy/paste the complete http query from your browser
and send it via IO::Socket::INET.

Regards

M.

Do you mean that I would use the mouse or imitate the events?
--
George

Great tragedy has come to us, and we are meeting it with the best that is
in our country, with courage and concern for others because this is
America. This is who we are.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

George · Dec 29, 2008

George> I did try what you suggest above. Was I to paste in the above where I
George> would otherwise do a keyword search?

Like this:

http://letmegooglethatforyou.com/?q...tributed+memory"+OR+"Numerid+Ried"+OR+OpenMP)

Watch and learn.

/me sighs

That's hilarious. I'll use that in my other discussion forums where people
are willfully ignorant of facts.

http://i429.photobucket.com/albums/qq15/george196884/fortran40.jpg

It correctly prompts for a respelling of Reid to reflect the person who has
been most identified with making co-arrays part of standard fortran.

Ultimately, it's a google keyword search, and that's not what I'm after.
--
George

Hundreds of thousands of American servicemen and women are deployed across
the world in the war on terror. By bringing hope to the oppressed, and
delivering justice to the violent, they are making America more secure.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

sln · Dec 29, 2008

George> I did try what you suggest above. Was I to paste in the above where I
George> would otherwise do a keyword search?

Like this:

http://letmegooglethatforyou.com/?q...tributed+memory"+OR+"Numerid+Ried"+OR+OpenMP)

Watch and learn.

/me sighs

"Enable javascript to use LMGTFY."

Doesen't work.

sln

sln · Dec 29, 2008

# shebang doesn't work on windows

use strict;
use warnings;

use Net::HTTP;
use LWP::Simple;

my $url = 'www.co-array.org';
my $html = get 'http://' . $url;

my $what = qr'distributed.*?memory|Numerid.*?Ried|OpenMP';

while($html =~ /$what/mg) { # note the /m modifier
print '...' . substr($html, pos($html)-20, 40) . "...\n"
}

my $s = Net::HTTP->new(Host => "www.co-array.org") || die $@;
$s->write_request(GET => "/", 'User-Agent' => "Mozilla/5.0");
my($code, $mess, %h) = $s->read_response_headers;

while (1) {
my $buf;
my $n = $s->read_entity_body($buf, 1024);
die "read failed: $!" unless defined $n;
last unless $n;
print $buf;
}

# perl wahab3.pl

# end script begin abridged output

C:\MinGW\source>perl wahab3.pl
...n distributed memory machines but also o...
...ranslate into OpenMP Fortran</A>
...
...ay Fortran to OpenMP Fortran translator<...
...uivalent SPMD OpenMP Fortran 90/95 progr...
...w.openmp.org">OpenMP Fortran</A> compile...
...tectures
than OpenMP Fortran, programs s...
...anslated into OpenMP as part of the comp...
...ntrinsics for OpenMP is bundled with the...
...anslates into OpenMP Fortran.
<L...
...mp99.ps">SPMD OpenMP vs MPI for Ocean Mo...
...y Fortran and OpenMP Fortran for SPMD Pr...
<HTML>
<HEAD>
<TITLE>Co-Array Fortran</TITLE>
<LINK REL="shortcut icon" HREF="favicon.ico" />
</HEAD>

<BODY BACKGROUND="" BGCOLOR="#ffff99" TEXT="#000000" LINK="#0000ff"
VLINK="#8000
00" ALINK="#ff0000">

<HR WIDTH="30%" ALIGN=CENTER>
<CENTER><B><H1>Co-Array Fortran</H1></B></CENTER>
<HR WIDTH="30%" ALIGN=CENTER>

<P>
Co-array Fortran is a small extension to
Fortran 95.

# end abridged output

Wow, Mirco, with a little guidance, a person can go from square one to
cruising speed *so quickly*. I think this gives me the tools I need to
move my token forward. Thanks so much.

Do you mean that I would use the mouse or imitate the events?

This is no more than grabbing html and searching for patterns using
regexp withing it. The result is not something neither reliable nor
formattable. You will still have to parse it before searches.
No parsing, no reliable content. Its still a mixed bag of junk.

sln

George · Dec 30, 2008

This is no more than grabbing html and searching for patterns using
regexp withing it. The result is not something neither reliable nor
formattable. You will still have to parse it before searches.
No parsing, no reliable content. Its still a mixed bag of junk.

I tried to address this issue with the following:

# shebang doesn't work on windows

use strict;
use warnings;
use LWP::Simple;
use HTML:

arser;
use HTML::FormatText;
my ($html, $ascii);
$html = get("http://www.co-array.com/");
defined $html
or die "Can't fetch HTML from http://www.perl.com/";
$ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;

# perl wahab4.pl

This yields:

C:\MinGW\source>perl wahab4.pl
Can't locate HTML/FormatText.pm in @INC (@INC contains: C:/Perl/site/lib
C:/Perl
/lib .) at wahab4.pl line 7.
BEGIN failed--compilation aborted at wahab4.pl line 7.

C:\MinGW\source>

Unfortunately, I get no matches when I search for Format in the PPM for
activestate, which sounds unlikely but clearly indicates that I'm over my
head.
--
George

This way of life is worth defending.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

George · Dec 30, 2008

Note that the /m modifier does absolutely nothing for the pattern
being used, and so is not needed at all.

You probably meant the /s modifier instead?

I think this is a significant issue and is where I'm stumbling right now:

# shebang doesn't work on windows

use strict;
use warnings;

use Net::HTTP;
use LWP::Simple;

my $url = 'www.alfrankensense.com/al_franken_quotes.html';
my $html = get 'http://' . $url;

#my $what = qr'*?-Al*?';
my $what = qr'distributed.*?memory|Al.*?Franken|OpenMP';

while($html =~ /$what/mg) { # note the /m modifier
print '...' . substr($html, pos($html)0, 40) . "...\n"
}

# perl wahab5.pl

I've tried /m /s /mg and struck out.

This is the page that I want to work on. Ultimately, I want the quotes to
be of the form where dialog can use them as randomquotes, which is:

I believe the most solemn duty of the American president is to protect the
American people. If America shows uncertainty and weakness in this decade,
the world will drift toward tragedy. This will not happen on my watch.
George W. Bush
%
I can hear you, the rest of the world can hear you and the people who
knocked these buildings down will hear all of us soon.
George W. Bush
%
I have a different vision of leadership. A leadership is someone who brings
people together.
George W. Bush
%
I just want you to know that, when we talk about war, we're really talking
about peace.
George W. Bush
--
George

The thing that's wrong with the French is that they don't have a word for
entrepreneur.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

Dr.Ruud · Dec 30, 2008

George said:
# shebang doesn't work on windows

Stop putting that in your scripts. Start reading the documentation.

In short: the path or even the name of the binary don't matter, but the
options do.

Unless of course you would bind .pl to a shebang.com that would just do
what you once expected.

Mirco Wahab · Dec 30, 2008

It prevents the dot from matching across line
boundaries, as you surely know.

not really

I think this is a significant issue and is where I'm stumbling right now:

George, what you probably intend to do is, as anybody
noted, "web scraping" which has to be applied (in your
case) to some ill structured html source. What I'd use
here is a kind of 'quick and dirty' regular expression
search.

I looked into your html source and conceived something
that does the job somehow. Please try to learn some
basics of regular expressions, eg.: from here:
http://oreilly.com/catalog/9781565922570/

my hack on your problem:

use strict;
use warnings;
use LWP::Simple;

# load the complete content of the url in question
# via LWP::Simple::get(...)
my $t = get 'http://www.alfrankensense.com/al_franken_quotes.html';

# inspect the web site and look at what "marker"
# your stuff usually starts, in your case - it's the tag:
# <center>Al Franken Quotes</center>

my @quotes; # array, where the quotes are to be collected

# *If* we got there:
if($t =~ /<center>Al Franken Quotes<\/center>/g) { # the inner / is ecscaped
# then we write a quick & dirty regular expression
# to map on the quote (look in the html for hints)
my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by whitespace
\-\s+ # which comes here (escaped -) ..
([^<]+) # this has to be the quote source until next html tag
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this expression

# apply this expression to the text
while($t=~ /$q/g) { # /g in scalar context (look it up)
push @quotes, [$1, $2]; # save found quote on array
} # quote in $1, source in $2
}

print "total: " . scalar @quotes . " quotes found\n";

for my $q (@quotes) { # now show what quotes we found
print_nice($q->[0], $q->[1]) # and format them however you want
}
# thats it

# we need to provide our special formatting subroutine
sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single space
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "$q" # print quote, followed by ...
. "\n" . '-'x40 ."\n" # new line + 40 x '-' + new line
. "- $s\n\n" # '-' + quote source + double \n
}

Regards

M.

Mirco Wahab · Dec 30, 2008

sorry for

It prevents the dot from matching across line
boundaries, as you surely know.

providing BS explanations. Seems like I'm getting old
and/or suffer from missing opportunities to practice
Perl Regex enough (as compared to Boost and PHP).

Thanks for your initial correction,

Mirco

Tad J McClellan · Dec 30, 2008

I expect you really meant "phrases" rather than "words" there?

That is, you want to match when "memory" follows "distributed",
with some whitespace in between?

If so, then:

Should instead be:

my $what = qr'distributed\s+memory|Numerid\s+Ried|OpenMP';

Now you will need neither m//m nor m//s...

Note that this will NOT match if the page instead contains

distributed memory
or
distributed<br>memory

etc ...

I think this is a significant issue and is where I'm stumbling right now:

m//m modifies the meaning of the "^" and "$" anchors. It is useless
on patterns that do not make use of those anchors.

m//s modifies the meaning of ".". It is useless on patterns that
do not make use of the dot metacharacter (like in the amended
pattern above).

my $url = 'www.alfrankensense.com/al_franken_quotes.html';

This is the page that I want to work on.

Getting this deep into a thread before the subject of the article
correlates with the Subject of the article is plain silly.

Please exercise more care in choosing the contents of your Subject header.

Hans Mulder · Dec 30, 2008

Dr.Ruud said:
Unless of course you would bind .pl to a shebang.com that would
just do what you once expected.

My copy of perlrun says that perl.exe does exactly that: it reads
the shebang line and redispatches to the path stated there (unless
the word "perl" occurs on that line).

When Perl was young, this was useful on Unix versions that didn't
support shebang; these days it is perhaps useful on Windows.

Hope this helps,

-- HansM

George · Dec 31, 2008

On Tue, 30 Dec 2008 12:17:44 +0100, Mirco Wahab wrote:

[snipped and reordered, for thematic reasons]

my hack on your problem:

use strict;
use warnings;
use LWP::Simple;

# load the complete content of the url in question
# via LWP::Simple::get(...)
my $t = get 'http://www.alfrankensense.com/al_franken_quotes.html';

# inspect the web site and look at what "marker"
# your stuff usually starts, in your case - it's the tag:
# <center>Al Franken Quotes</center>

my @quotes; # array, where the quotes are to be collected

# *If* we got there:
if($t =~ /<center>Al Franken Quotes<\/center>/g) { # the inner / is ecscaped
# then we write a quick & dirty regular expression
# to map on the quote (look in the html for hints)
my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by whitespace
\-\s+ # which comes here (escaped -) ..
([^<]+) # this has to be the quote source until next html tag
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this expression

# apply this expression to the text
while($t=~ /$q/g) { # /g in scalar context (look it up)
push @quotes, [$1, $2]; # save found quote on array
} # quote in $1, source in $2
}

print "total: " . scalar @quotes . " quotes found\n";

for my $q (@quotes) { # now show what quotes we found
print_nice($q->[0], $q->[1]) # and format them however you want
}
# thats it

# we need to provide our special formatting subroutine
sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single space
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "$q" # print quote, followed by ...
. "\n" . '-'x40 ."\n" # new line + 40 x '-' + new line
. "- $s\n\n" # '-' + quote source + double \n
}

We're getting real close here.

use strict;
use warnings;
use LWP::Simple;

# load the complete content of the url in question
# via LWP::Simple::get(...)
my $t = get 'http://www.alfrankensense.com/al_franken_quotes.html';

# inspect the web site and look at what "marker"
# your stuff usually starts, in your case - it's the tag:
# <center>Al Franken Quotes</center>

my @quotes; # array, where the quotes are to be collected

# *If* we got there:
if($t =~ /<center>Al Franken Quotes<\/center>/g) { # the inner / is
ecscaped
# then we write a quick & dirty regular expression
# to map on the quote (look in the html for hints)
my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by
whitespace
\-\s+ # which comes here (escaped -) ..
([^<]+) # this has to be the quote source until next
html tag
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this
expression

# apply this expression to the text
while($t=~ /$q/g) { # /g in scalar context (look it up)
push @quotes, [$1, $2]; # save found quote on array
} # quote in $1, source in $2
}

print "total: " . scalar @quotes . " quotes found\n";

for my $q (@quotes) { # now show what quotes we found
print_nice($q->[0], $q->[1]) # and format them however you want
}
# thats it

# we need to provide our special formatting subroutine
sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single
space
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "$q" # print quote, followed by ...
. "\n" # new line
. "~~ $s\n" # '-' + quote source + double \n
. "% \n" # a percentage sign between quotes
}

# perl wahab7.pl >\Program Files\40tude Dialog\sigs\frank1.txt
# perl wahab7.pl >frank1.txt

The output is really close. I can't see any difference between the format
of the two following files

%
A dictatorship would be a heck of a lot easier, there's no question about
it.
George W. Bush
%
After the chaos and carnage of September 11th, it is not enough to serve
our enemies with legal papers.
George W. Bush
%
America is a friend to the people of Iraq. Our demands are directed only at
the regime that enslaves them and threatens us. When these demands are met,
the first and greatest benefit will come to Iraqi men, women and children.
George W. Bush
%
America is a Nation with a mission - and that mission comes from our most
basic beliefs. We have no desire to dominate, no ambitions of empire. Our
aim is a democratic peace - a peace founded upon the dignity and rights of
every man and woman.
George W. Bush

and

%
The biases the media has are much bigger than conservative or liberal.
They're about getting ratings, about making money, about doing stories that
are easy to cover.
~~ Al Franken,
%
[G. W. Bush's] pro-air pollution Clear Skies Initiative is designed to
clear the skies of birds.
~~ Al Franken,
%
And just like in 1984, where the enemy is switched from Eurasia to
Eastasia, Bush switched our enemy from al Qaeda to Iraq. Bush's War on
Terror is a war against whomever Bush wants to be at war with.
~~ Al Franken,
%
Mistakes are a part of being human. Appreciate your mistakes for what they
are: precious life lessons that can only be learned the hard way. Unless
it's a fatal mistake, which, at least, others can learn from.
~~ Al Franken,
%

, but right now, dialog doesn't think that these quotes are delimited by a
percentage sign. Also, we need to remove the commas after Al Franken, when
inappropriate. I think I can manage that tonight.

$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here

I don't understand what these statements do.

I looked into your html source and conceived something
that does the job somehow. Please try to learn some
basics of regular expressions, eg.: from here:
http://oreilly.com/catalog/9781565922570/

Thanks for your response, M., I'll read up tonight.

--
George

The terrorists and their supporters declared war on the United States - and
war is what they got.
George W. Bush

Picture of the Day http://apod.nasa.gov/apod/

Mirco Wahab · Dec 31, 2008

George said:
On Tue, 30 Dec 2008 12:17:44 +0100, Mirco Wahab wrote:
[snipped and reordered, for thematic reasons]

my hack on your problem:
if($t =~ /<center>Al Franken Quotes<\/center>/g) { # the inner / is ecscaped
# then we write a quick & dirty regular expression
# to map on the quote (look in the html for hints)
my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by whitespace
\-\s+ # which comes here (escaped -) ..
([^<]+) # this has to be the quote source until next html tag
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this expression

Click to expand...

BTW, there is an error in above expression. Replace everything
from
my $q {
...
to
}sx

by this slight modification:

my $q = qr{ \t # the quote is always preceeded by a tab
"([^"]+)" # find ", save all (saved to $1), to another "
.+? # fine, now look up a '-' followed by whitespace
\-\s+ # which comes here (escaped -) ..
(.+?) # this has to be the quote source ($2) until
</b> # <== the terminal html (closing) tag which is a </b>
}sx; # the /s lets the .(dot) match across lines
# the /x allows us to format and comment this expression

The alfranken-page has sometimes html within he quote source,
the <== part of the expression makes the difference.
(The stray commas will be gone too, this was the reason.)

....

We're getting real close here.
...
sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "$q" # print quote, followed by ...
. "\n" # new line
. "~~ $s\n" # '-' + quote source + double \n
. "% \n" # a percentage sign between quotes
}
...
but right now, dialog doesn't think that these quotes are delimited by a
percentage sign. Also, we need to remove the commas after Al Franken, when
inappropriate. I think I can manage that tonight.

OK, your quotes should probably *start* with the %, so replace
the print_nice by this one:

sub print_nice {
my($q, $s) = @_; # shift actual arguments into variables
$q =~ s/\s+/ /g; # quote: transfer multiple whitespace to a single
$q =~ s/<[^>]+>//g; # quote: remove html formatting
$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here
print "%\n$q\n" # print $, newline, quote, newline
. "~~ $s\n" # '~~' + quote source + \n
}

Furthermore, "remove" the line (further above)
# print "total: " . scalar @quotes . " quotes found\n";
by commenting out with '#'.

$s =~ s/\s+/ /g; # source: same here
$s =~ s/<[^>]+>//g; # source: same here

I don't understand what these statements do.

This is (line one) substitution (s/) of any count
of successive whitespace (\s+) by a single space ' '
and (second line) substitution of a html tag of
any kind like <stuff within brackets> by 'emptiness'.

See: http://www.anaesthetist.com/mnm/perl/Findex.htm#regex.htm

Regards

M.

opening a file	93	Jan 8, 2009
processing text	12	Jan 15, 2009
a simple control in an nntp client	31	Dec 4, 2008
herding ones and zeroes into bytes	10	Dec 7, 2008
volatile in C99	3	Nov 29, 2008
making bytes out of bits	33	Dec 7, 2008
*scanf in Harbison and Steele	16	Nov 25, 2008
Reading poorly structured data	4	Dec 8, 2004

searching for franken

George

Mirco Wahab

Randal L. Schwartz

Jürgen Exner

Tad J McClellan

Tad J McClellan

Randal L. Schwartz

George

George

sln

sln

George

George

Dr.Ruud

Mirco Wahab

Mirco Wahab

Tad J McClellan

Hans Mulder

George

Mirco Wahab

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads