Searching google in java

M

mfasoccer

Im working on a project that involves searching with google. I have
been getting an http 403 error with the following code:

import java.net.*;
import java.io.*;

public class GoogleSearchTest
{
public static void main(String[] args) throws Exception{
URL hp = new URL("http://www.google.com/search?q=babelfish");
URLConnection hpCon = hp.openConnection();
hpCon.connect();
InputStream input = hpCon.getInputStream(); // error traces to here

/*
This code is all irrelevant to my problem because
the inputstream is refuted
String content = "";
int c;
while((c = input.read()) != -1)
content += (char)c;
*/
}
}

I know that http 403 error means that the server understood the
request, yet refused it. As you can probably tell I have very little
network programming experience, so maybe more experienced programmers
could help alter my approach, or explain a better one? Thanks.
 
P

Patricia Shanahan

Im working on a project that involves searching with google. I have
been getting an http 403 error with the following code:
....

Google offers a Java API, see http://www.google.com/apis/. It is much
easier than trying to get and parse a web page.

Note that they limit automated searching to 1000 queries per day,
non-commercial, and require a license key with each request.

Patricia
 
A

alexandre_paterson

I know that http 403 error means that the server understood the
request, yet refused it. As you can probably tell I have very little
network programming experience, so maybe more experienced programmers
could help alter my approach, or explain a better one? Thanks.

A better approach would be to use Google' APIs as Patricia pointed
out.

However this is not always an option (the API didn't help
for, eg, groups.google.com last time I checked [but this was
a long time ago I admit]).

Faking your user agent string will allow you to bypass the 403
(and it probably would be a breach of Google's terms).
 
M

mfasoccer

Faking your user agent string will allow you to bypass the 403

Could any provide a sample of how to fake my agent string?
 
A

alexandre_paterson

In your example, you insert one line:

URLConnection hpCon = hp.openConnection();
hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
hpCon.connect();

and that may work.

But you still should respect Google's terms...
 
M

mfasoccer

URLConnection hpCon = hp.openConnection();
hpCon.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U;
Windows NT 5.0; en-US; rv:1.7.8) Gecko/20050511");
hpCon.connect();
it works, thanks.
 
M

mfasoccer

But you'll still get the same restriction of 1000 hits per day however you

Does this mean that even regular searches that are executed through
their website with an actual browser are also limited to 1000 hits per
day?
 
T

Thomas Weidenfeller

Does this mean that even regular searches that are executed through
their website with an actual browser are also limited to 1000 hits per
day?

You are not doing a regular search via a browser. You are trying to do
some automated querying. Googles ToS prohibits this
http://www.google.com/terms_of_service.html. Whatever you are trying to
do, you idea is flawed, since it is based on the concept of violating
the terms-of-service of the service you are using.

And do you really think you are the first one who had the glorious idea
to "work around" the API limitation (read: violate the ToS) by
simulating a browser?

The irony is that you even use a Google mail address to plan and
announce your intended violation of Google's ToS in public. What a great
idea.
 
J

jeremiah johnson

Does this mean that even regular searches that are executed through
their website with an actual browser are also limited to 1000 hits per
day?

Google is *extremely* good at detecting automated queries. Just get
your program working, query Google a few hundred times, then try to
visit google.com in your browser. You will very likely see a message
that they have detected you.

Someone at my employer tried this the other day. A few hundred
automated queries later and the entire Fortune 50 company had to go
through a CAPTCHA each time we wanted to use Google. 180,000+ people.
 
B

Bent C Dalager

Someone at my employer tried this the other day. A few hundred
automated queries later and the entire Fortune 50 company had to go
through a CAPTCHA each time we wanted to use Google. 180,000+ people.

How good are their CAPTCHAs? Is there a way to see them without first
getting oneself banned?

Cheers
Bent D
 
I

IchBin

ashesh said:
hi!! have any one have idea about Hibernet,if u do then plz tell me
about this.
Do a google search on hibernet java

then look at the first article.

Thanks in Advance...
IchBin, Pocono Lake, Pa, USA
http://weconsultants.servebeer.com/JHackerAppManager
__________________________________________________________________________

'If there is one, Knowledge is the "Fountain of Youth"'
-William E. Taylor, Regular Guy (1952-)
 
R

Roedy Green

I know that http 403 error means that the server understood the
request, yet refused it.

Here is what I would do. I don't know if this is the problem though.

Use a sniffer to watch the same query given by a browser. See
http://mindprod.com/jgloss/sniffer.html

Pad your request header out with additional fields the browser sends,
e.g. info on what encodings are acceptable in reply.
 
R

Roedy Green

How good are their CAPTCHAs? Is there a way to see them without first
getting oneself banned?

It would not take too much cleverness. All they have to do in monitor
hits per hour from a given IP. If it suddenly jumps up, and if the
hits have a stereotyped rigidity of format and timing, they have you
nailed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top