Pattern matching problem.

D

demolitionz

I'm working on a project to try and write a program in perl which will
connect to google, search for a specified keyword and return the URLs
found. My problem is that I can only get the program to return the
first URL found, and despite spending a good few hours playing around
with it and searching the web for answers, I can't seem to solve the
problem. Here's the code I've written so far...

#!usr/bin/perl
use warnings;
use strict;
if ($ARGV[0] eq '') { print "Script called incorrectly.\nFormat:
google.pl keyword"; exit; }
use LWP::UserAgent;
my $browser = LWP::UserAgent->new;
my $response =
$browser->get("http://www.google.com/search?q=$ARGV[0]");
if ($response->is_success) {
if ($response->content =~ m{<font color=#008000>(.*?)</font>}i)
{
print "$1\n";
}
else { print "Could not connect"; }
}
exit;

Now I personally assumed the solution would have been as easy as
changing $1 to $2 to get the second URL, but it doesn't seem so. That
being the case I assume this script will need a total rework, but have
no idea where to even begin. Can anyone help?
 
F

Fabian Pilkowski

I'm working on a project to try and write a program in perl which will
connect to google, search for a specified keyword and return the URLs
found.

Nice, Google is providing an API for this. You haven't to parse any
webpage. Just get the data you want to have. Have a look at

http://www.google.com/apis/

for creating a free Google Account you need for usage. To use Google's
API from Perl, ask CPAN for some help. I suggest to start with module

WWW::Search::Google

If your desires aren't fullfilled, start another look at Net::Google.
This module is the basis for the first one, and you could do more
specific things.
My problem is that I can only get the program to return the
first URL found, and despite spending a good few hours playing around
with it and searching the web for answers, I can't seem to solve the
problem. Here's the code I've written so far...
if ($response->content =~ m{<font color=#008000>(.*?)</font>}i)
{
print "$1\n";
}
else { print "Could not connect"; }
}
exit;

Now I personally assumed the solution would have been as easy as
changing $1 to $2 to get the second URL, but it doesn't seem so. That
being the case I assume this script will need a total rework, but have
no idea where to even begin. Can anyone help?

To match more than once you could use a loop. Untested:

while ( $response->content =~ m{<font color=#008000>(.*?)</font>}ig ) {
print "$1\n";
}

Note the g-modifier behind the regex.

regards,
fabian
 
D

demolitionz

Klaus said:
while ($response->content =~ m{<font color=#008000>(.*?)</font>}ig)
{
print "$1\n";
}

Thanks for the responses. I gave the quoted bit of code a go but
unfortunately it just went into an infinite loop repeating the first
result. I also tried replacing "while" with "foreach", but that didn't
work either. I've been playing around with the original idea some more
and have finally got it to work through a very messy bit of code.
Unfortunately due to google's formatting I get <b> html tags in the
middle of some of my results, but that can be ironed out later. I've
attached the working code below just in case people were
interested/want to make it a bit less messy ;) PS this code was
written in haste and ever increasing frustration, so the names of the
arrays etc are rather random!

#!usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] eq '') { print "Script called incorrectly.\nFormat:
google.pl keyword"; exit; }
use LWP::UserAgent;
my $browser = LWP::UserAgent->new;
my $response =
$browser->get("http://www.google.com/search?q=$ARGV[0]");
if ($response->is_success) {
my $content = $response->content;
my @broken = split("<br>",$content);
my $searchterm = "<font color=#008000>(.*?)</font>";
my @found = grep(/$searchterm/i, @broken);
foreach (@found) { if ($_ =~ m{<font color=#008000>(.*?)</font>}ig) {
@_ = split(' ',$1); print "$_[0]\n"; } }
}
exit;
 
K

Klaus Eichner

[snip]
My problem is that I can only get the program to return the
first URL found
if ($response->content =~ m{<font color=#008000>(.*?)</font>}i)
{
print "$1\n";
}
Now I personally assumed the solution would have been as easy as
changing $1 to $2 to get the second URL, but it doesn't seem so.

No need to change $1, just add the g option at the end of the regular
expression "m{...}i" and make it a while-loop rather than a simple if. That
should do the trick. (see also "perldoc perlop", paragraph "Regexp
Quote-Like Operators")

while ($response->content =~ m{<font color=#008000>(.*?)</font>}ig)
{
print "$1\n";
}
 
E

Eric Amick

Thanks for the responses. I gave the quoted bit of code a go but
unfortunately it just went into an infinite loop repeating the first
result.

$response->content is a method call, and the //g business works properly
only when the string does not change from pass to pass in the loop. Try

my $content = $response->content;
while ($content =~ m{<font color=#008000>(.*?)</font>}ig)

instead.
 
K

Klaus Eichner

Thanks for the responses. I gave the quoted bit of code a go but
unfortunately it just went into an infinite loop repeating the first
result.

I don't think that the "while (...m{...}ig)" is directly responsible for the
infinite loop.

Here is a small, but complete example to demonstrate the principle of "while
(...m{...}ig)":
============================================
use strict;
use warnings;

my $resp = q{
<html>
<body bgcolor="#ffffff">
<title>xxx</title>
<font color=#008000>item 1</font><br>
<font color=#008000>item 2</font><br>
<font color=#008000>item 3</font><br>
<font color=#008000>item 4</font><br>
</body>
</html>
};

while ($resp =~ m{<font color=#008000>(.*?)</font>}ig)
{
print "$1\n";
}
============================================

The output of that program is:
======================
item 1
item 2
item 3
item 4
======================

...have finally got it to work.

I am happy that you finally succeeded.

[snip]
my $response =
$browser->get("http://www.google.com/search?q=$ARGV[0]");
if ($response->is_success) {
my $content = $response->content;
my @broken = split("<br>",$content);
my $searchterm = "<font color=#008000>(.*?)</font>";
my @found = grep(/$searchterm/i, @broken);
foreach (@found) { if ($_ =~ m{<font color=#008000>(.*?)</font>}ig) {
@_ = split(' ',$1); print "$_[0]\n"; } }
}
exit;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,190
Latest member
ClayE7480

Latest Threads

Top