Why can't I parse google search results?

B

bob

I'm trying to extract data from the results page of search engines
with these two
modules use LWP::Simple and HTML::parse, and the get command.

I can extract from yahoo and altavista but google is not cooperating.

I get this error message

Can't fetch HTML from http://www.google.com/search?q=smeghead at
parsing.pl line 13.



I obviously missing something but I don't know what it is. Help would
be greatly appreaciated. Thank you.
 
A

Alan J. Flavell

(e-mail address removed) (bob) wrote in


The error is on line 13.

Joking apart - some of the hardest-to-diagnose errors are those
where the error report is pointing somewhere else than the line which
is _really_ in error, due to some kind of knock-on effect.
How can we know if we don't see the code?

Let's not tempt the newbie to shovel their entire 600-line script onto
Usenet, though.

We _do_ need to see the code in some kind of appropriate context,
sure. The advice in the group's posting guidelines (as posted
regularly by Tad) would stand the questioner in good stead, if they
would only read it and at least give an impression that they're
following its advice.

Hint: the above line doesn't appear to be an error message coming
from Perl itself. Ergo, it's probably an error from some code written
in Perl. Look more closely at that code - work out whether it can
provide some additional diagnostics, and, if it can, then work out why
they aren't being displayed by the calling program. (the variable $!
may be of interest, for example).
 
A

A. Sinan Unur

Joking apart - some of the hardest-to-diagnose errors are those
where the error report is pointing somewhere else than the line which
is _really_ in error, due to some kind of knock-on effect.

That is true.
Let's not tempt the newbie to shovel their entire 600-line script onto
Usenet, though.

Which is why I asked for the code, but you are absolutely right, I should
have pointed the OP either to the posting guidelines or explained how to
post source code.
We _do_ need to see the code in some kind of appropriate context,
sure. The advice in the group's posting guidelines (as posted
regularly by Tad) would stand the questioner in good stead, if they
would only read it and at least give an impression that they're
following its advice.

So, a plea to the OP: Please read the quidelines before posting source
code:

http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Doing so and following the recommendations therein will ensure you can
get the best help possible.

Sinan
 
K

Kevin Shay

(e-mail address removed) (bob) wrote in message
I'm trying to extract data from the results page of search engines
with these two
modules use LWP::Simple and HTML::parse, and the get command.

I can extract from yahoo and altavista but google is not cooperating.

I get this error message

Can't fetch HTML from http://www.google.com/search?q=smeghead at
parsing.pl line 13.

It appears Google won't give you a page unless you send a User-Agent
the request, which LWP::Simple doesn't do. Try using LWP::UserAgent
instead.

http://www.perldoc.com/perl5.8.0/lib/LWP/UserAgent.html

Note that fetching Google results programmatically is most likely a
violation of Google's Terms of Service. Not that there would be any
consequences, but I thought I'd point this out. If you wanted to be
above-board about it, you could use the Google API:

http://www.google.com/apis/

--Kevin
 
K

Keith Keller

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Note that fetching Google results programmatically is most likely a
violation of Google's Terms of Service.

I'm not so sure--the closest violation would be for an ''offline''
search of Google, but since they don't define what that's supposed to
mean, I'd bet that running a script that performs a Google search would
be fine. Putting said script into a cron job might not be fine, but who
knows?
If you wanted to be
above-board about it, you could use the Google API:

http://www.google.com/apis/

The Google API has restrictions as well--IIRC you're limited to 100
searches a day. :)

- --keith

- --
(e-mail address removed)-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/04krhVcNCxZ5ID8RAqx1AKCWeKroZ7F01g+39gSy4cGQwYRxPwCePnhl
gfINSpNyZx2zIbuWZqtqTbM=
=8Nv/
-----END PGP SIGNATURE-----
 
G

Gisle Aas

It appears Google won't give you a page unless you send a User-Agent
the request, which LWP::Simple doesn't do.

This is not true. LWP::Simple does send a User-Agent header. Problem
here is that Google blocks requests with the default LWP User-Agent
header.
 
P

pkent

I'm trying to extract data from the results page of search engines
with these two
modules use LWP::Simple and HTML::parse, and the get command.

I can extract from yahoo and altavista but google is not cooperating.

Google has an API that is designed for machines to use and can happily
be used from Perl:
http://w.google.com/apis/
Remember to read and comply with the terms of service, etc :)
I get this error message

Can't fetch HTML from http://www.google.com/search?q=smeghead at
parsing.pl line 13.

I would bet, but I don't know for sure, that Google rejects queries that
come in with certain Useragent headers and LWP::UserAgent has a default
string.

P
 
B

bob

My code is short. I didn't post my code because I was neglectful and
I thought I supplied enough information for the gist of my question.


use LWP::Simple;
use HTML::parse;
use HTML::FormatText;
$html = get("http://www.google.com/search?q=smeghead");
defined $html or die "Can't fetch HTML from http://www.perl.com/";
$ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;

AS I mentioned earlier this works for yahoo but not google. Since it
works with yahoo I don't believe there is a problem with the code, but
with google. Or is there a problem with the code?

I thank those of you who suggested the google apis. If google is
blocking my requests with the default LWP User-Agent header, then
obviously I have to make some changes.
Thank you.
 
A

A. Sinan Unur

(e-mail address removed) (bob) wrote in @posting.google.com:
My code is short. I didn't post my code because I was neglectful and
I thought I supplied enough information for the gist of my question.

And others have already answered your question.
use LWP::Simple;
use HTML::parse;
use HTML::FormatText;
$html = get("http://www.google.com/search?q=smeghead");
defined $html or die "Can't fetch HTML from http://www.perl.com/";
$ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;

AS I mentioned earlier this works for yahoo but not google. Since it
works with yahoo I don't believe there is a problem with the code, but
with google. Or is there a problem with the code?

Try

lwp-request http://www.google.com/search?q=smeghead > t.html

on the command line and view the file in your browser.
 
I

Iain Chalmers

My code is short. I didn't post my code because I was neglectful and
I thought I supplied enough information for the gist of my question.


use LWP::Simple;
use HTML::parse;
use HTML::FormatText;
$html = get("http://www.google.com/search?q=smeghead");
defined $html or die "Can't fetch HTML from http://www.perl.com/";
$ascii = HTML::FormatText->new->format(parse_html($html));
print $ascii;

AS I mentioned earlier this works for yahoo but not google. Since it
works with yahoo I don't believe there is a problem with the code, but
with google. Or is there a problem with the code?

I thank those of you who suggested the google apis. If google is
blocking my requests with the default LWP User-Agent header, then
obviously I have to make some changes.

You can't parse it because you're not getting it. You don't have a
parsing problem - try printing $html - google isn't sending you anything.

And yes, it _is_ because of your UserAgent, and as others have pointed
out, using the google api is the way google would like you to solve your
problem.

big
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top