Automating Searches

nowwho · Jan 3, 2007

Hey,

New to Java! Trying to automate searhing Google, Yahoo, MSN, AOL
and Ask by sending queries to those engines using a java program and
storing the returned URL's in a MYSQL database. The program will open a
text file, upload the first line as the query, connect to each of the
search engines, and store URL's in a table called "Results_Table" which
has the following columns:

Search_Eng - This would record the searh engine name
Query - This would record the query text
Returned_URL - This is the URL thathe search engine returned
URL_Num - This is the Number of the URL's position from the search
engine.

Is it possible to do this and store the first 100 URL's the query
returns from each search engine?

Thanks!

Andrew Thompson · Jan 3, 2007

New to Java! Trying to automate searhing Google,

See the Google search API, but be prepared to pay,
for anything beyond the nominal numbers of queries
the Google API permits for free.

...Yahoo, MSN, AOL and Ask ...

Dunno.. Aren't most of them using data from
Google, in any case?

...by sending queries to those engines using a java program and
storing the returned URL's in a MYSQL database.

Why would your users prefer to query your DB, than
query Google directly (for up to the moment data)?

...The program will open a
text file, upload the first line as the query, connect to each of the
search engines, .....
Is it possible to do this and store the first 100 URL's the query
returns from each search engine?

Certainly - through whatever public API the search
engine offers - talk to their tech. departments and
they'll most probably instruct you how to get the
data as XML (or something else conveniently as
portable and easily parsable).

Andrew T.

Daniel Pitts · Jan 3, 2007

Andrew said:
See the Google search API, but be prepared to pay,
for anything beyond the nominal numbers of queries
the Google API permits for free.

Dunno.. Aren't most of them using data from
Google, in any case?

Why would your users prefer to query your DB, than
query Google directly (for up to the moment data)?

Certainly - through whatever public API the search
engine offers - talk to their tech. departments and
they'll most probably instruct you how to get the
data as XML (or something else conveniently as
portable and easily parsable).

Andrew T.

Also, make sure you read the terms of use for all those services.

Although, I do wonder why you would want to store search results in a
database.
Its not that hard to make a data scrapper, and just use the website
directly. But Google DOES give you an API to do it more easily.

Luc The Perverse · Jan 4, 2007

Andrew Thompson said:
Dunno.. Aren't most of them using data from
Google, in any case?

Um . . . Certainly Yahoo and MSN are not.

Andrew Thompson · Jan 4, 2007

Luc said:
Um . . . Certainly Yahoo and MSN are not.

OK - I see lots of hits for MSN bots in my server logs,
but not one for Yahoo. What does it's bot identify itself
as?

Andrew T.

TechBookReport · Jan 4, 2007

Andrew said:
OK - I see lots of hits for MSN bots in my server logs,
but not one for Yahoo. What does it's bot identify itself
as?

Andrew T.

Look for: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/)

Andrew Thompson · Jan 4, 2007

TechBookReport said:
Andrew Thompson wrote: ... ....
Look for: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/)

OK - I see them now..

Yahoo! - 9246
msn - 21457
goog - 7638

I was surprised I did not find them on the first search..
Must have been something stupid I did.. (shrugs)

BTW - nice to see you 'about the place' again..
I think of you whenever somebody asks after books,
but a quick, very tentative, search failed to lay an URL
on your site. I'll bookmark it.

Andrew T.

John Ersatznom · Jan 4, 2007

Daniel said:
Although, I do wonder why you would want to store search results in a
database.
Its not that hard to make a data scrapper, and just use the website
directly. But Google DOES give you an API to do it more easily.

Yeah, but using that API (at least very much) is expensive. Scraping the
results after submitting a normal query URL and a) not diving too deeply
into the results or b) doing new queries too often you can probably fly
under the radar and unless you're coming from a datacenter somewhere
they won't know you from Adam doing manual searches in Firefox.

To top it off, Java makes transparently caching pages (and with 1.6
implementing cookies) easier too. Add in a deliberate request of the
front page before doing the search query, some random delays, and a
spoofed user-agent, and I'm guessing the only way Google could figure
out you weren't just a surfer using Mozilla 4.0 (compatible; MSIE 4.0)
would be by using a tool like EtherSniffer to analyze your incoming
requests and discovering that Java sends the HTTP headers in an
idiosyncratic sequence. And they won't do that unless your IP generates
an eyebrow-raising amount of traffic.

And for Google that "eyebrow-raising" threshold is set very high indeed;
"normal" traffic for Google is millions of searches per day and there
are frequently dozens per day from each of many individual IP addresses
as well as untold numbers of one-offs and the like.

And, of course, as long as you don't generate more traffic faster than
you could by typing in all those queries manually, I don't see any moral
qualms with this. At worst it's equivalent to adblocking the sponsored
links on the results page with a commonly-available Firefox extension.
All you've done is automate some tedium at your end without having any
discernible effect at theirs versus not automating the tedium. So unless
you do believe in victimless crimes or don't believe in the identity of
indiscernibles ...

Andrew Thompson · Jan 4, 2007

John Ersatznom wrote:
....

And, of course, as long as you don't generate more traffic faster than
you could by typing in all those queries manually, I don't see any moral
qualms with this.

One might also argue that you were free to build
your own web-crawler, parse the pages it finds
for the content and links*, store the data in searchable
form, then rate and rank it according to whatever
criterion best suit you[1]. * Oh, of course then
'repeat for each link', & repeat each 7(?) days.

Setting up the software and hardware capable of
achieving that task, might cost a lot of money (I
guess) OTOH - you can pay a fee to someone
who has already gone to the effort, and has the
expertise.

Just because it is technically possible** to rip
Google off, does not make it right.

** + all the other iditioc reasons people generally
put forward to justify such theft, starting with..
- 'they don't have a right - it is free data!'. No it isn't -
the web pages themselves are free, but the search
engines hope to add value by sorting and filtering.

Also, Google is no 'monopoly'. As has been pointed
out in this (AFAIR) thread. You don't like Google's
prices? Go to the competition..

[1] And then, can you make it publicly available,
so I can rip your data, and resell it to my paying
clients?

Andrew T.

nowwho · Jan 4, 2007

Hey,
Thanks for the information so far. I didn't realise there was so much
legal stuff envolved, its for a once off educational project. Didn't
think it would amount to spamming. The pogram would only be run about
50 times in total. There is a set number of queries, and a set number
of results returned. As its an eductional project I never thought of
the legal side!

Chris Uppal · Jan 4, 2007

John said:
Add in a deliberate request of the
front page before doing the search query, some random delays, and a
spoofed user-agent, and I'm guessing the only way Google could figure
out you weren't just a surfer using Mozilla 4.0 (compatible; MSIE 4.0)
would be by using a tool like EtherSniffer to analyze your incoming
requests and discovering that Java sends the HTTP headers in an
idiosyncratic sequence. And they won't do that unless your IP generates
an eyebrow-raising amount of traffic.

Google can and does have more intelligence than that.

The simplest thing to look for is the originating IP address of the request (at
the TCP/IP level). A suspicious pattern of requests from one IP (e.g. too many
in one time period), and Google will stop serving queries from that IP address.
(The originating IP /can/ be spoofed, but not too many Java programmers will
have the necessary skills, and in any case is hardly worth the effort.) That
criterion can also give false positives; for instance if an organisation is
working behind a NAT, so if one person from that organisation is detected
abusing Google's services, the entire organisation will be blocked. Does
Google care ? Why should it ?

Then, too, Google has available /all/ the data which enters its data-centres;
from low-level fingerprinting of IP packets, up through checking HTTP headers,
extending all the way to historical and cross-site access patterns (I would be
very surprised if they didn't use a custom TCP/IP stack implementation for
their HTTP servers). How much of that information it actual uses (or even
collects) I don't know -- but I'd guess that it collects most of it, and uses
as much as it feels it has to in order to prevent abuse.

And they do actively work to prevent abuse. There are many kinds of possible
abuse, and I imagine Google work to prevent most of them, but I doubt if there
are many things they dislike more than people attempting to steal their data.

-- chris

Andrew Thompson · Jan 4, 2007

nowwho said:
Hey,
Thanks for the information so far. I didn't realise there was so much
legal stuff envolved, its for a once off educational project.

You 'ivory tower' types are *so* naiive. It's cute. ;-)

...Didn't
think it would amount to spamming.

I am not sure I would use that term for it.

Spamming is generally pushing an advertising
related message out to people who do not want it.

This (when done the 'wrong way') simply amounts
to a bit of theft of the resources of others.

& for my part, while I might hassle the thieves,
I'll bludgeon the spammers.

...The pogram would only be run about
50 times in total.

I think you might be well placed to use the 'legal
and free' API's currently offered! Surely even the
small numbers of queries Google offers for free
would cover your requirement?

(In any case, from what I understand, Google simply
refuses further requests for the day if the limit
is struck - no hard feelings, and back tomorrow..)

...There is a set number of queries, and a set number
of results returned. As its an eductional project I never thought of
the legal side!

Don't forget the there can be a few 'legalities' to the
educational side of things. Be careful of tripping over
using someone elses code without proper attribution
or accreditation.. Plagiarism/academic misconduct.
There was a classic thread on these groups from
a chap by the name of RoboCop - he got to find
out the hard way.

Andrew T.

nowwho · Jan 4, 2007

Andrew said:
I am not sure I would use that term for it.

Fair enough, computers and technology aren't my main interest of study.

I think you might be well placed to use the 'legal
and free' API's currently offered! Surely even the
small numbers of queries Google offers for free
would cover your requirement?

More than likely, but would still require advise on how to incorporate
these into a Java program.

Don't forget the there can be a few 'legalities' to the
educational side of things. Be careful of tripping over
using someone elses code without proper attribution
or accreditation.. Plagiarism/academic misconduct.
There was a classic thread on these groups from
a chap by the name of RoboCop - he got to find
out the hard way.

The use of other peoples code is allowed , however ALL work and ALL
sources of information used in any way required for the project have to
be detailed, we were well warned about the conquences of plagiarism.
All websites accessed for the project along with any copyright date
must be included along with the date that the website was accessed
etc...

NoNickName · Jan 5, 2007

Andrew said:
..

BTW - nice to see you 'about the place' again..

Thanks. Been busy with end of year deadlines recently. Should be around
a bit more often now though.

John Ersatznom · Jan 5, 2007

nowwho said:
Hey,
Thanks for the information so far. I didn't realise there was so much
legal stuff envolved, its for a once off educational project. Didn't
think it would amount to spamming. The pogram would only be run about
50 times in total. There is a set number of queries, and a set number
of results returned. As its an eductional project I never thought of
the legal side!

It's not spamming -- I don't know what the other guy was smoking when he
wrote the post you're replying to. There is NO DIFFERENCE discernible to
Google if you

a) do 10 searches during the day by typing in a Firefox window while
doing research or
b) have your computer do the searches with less/no typing on your part

Google is being "ripped off" iff you do something like:

a) use huge amounts of their bandwidth -- well in excess of a normal
user doing a bit of heavy research say, generating large numbers of
searches or delving very deeply into the result set. Fetching 10
first-pages-of-results one for each of 10 queries, whether done by one
mouse click or ten typed-in queries, has little impact on them, and of
course the one mouse click case makes it actually 10 queries instead of
11 because you mistyped one and had to do it again

b) or use google search results to populate your own rival "search
engine" site with revenue-generating ads or what-have-you, either by
scraping google's database or by just putting up a page with a script
that takes peoples' queries and passes them to google, then takes the
result page and replaces google's sponsored links with umpteen flashing
banner ads. Then you're using google's work output to actually compete
against google, rather than simply using google for research. That makes
a crucial difference.

Using code to drive Google lightly and for personal/educational/research
reasons rather than commercial ones doesn't seem to be evil to me,
especially if they cannot in practise distinguish it from "normal" use
anyway, as it isn't producing excessive traffic or being used to compete
against google in some way.

In fact, where do you draw the line? Firefox with manually-typed queries
is OK. Then we have Firefox with a MRU for queries; Firefox with query
guessing or autocompletion based on your current activities and
interests; Firefox with a plugin to take the result set too and
transform it e.g. to show 50 rather than 10 hits or to weed out
"supplemental results" that are usually MFA sites that really ARE
ripping off google; Firefox with a plugin to run the query of your
choice and bookmark the results every few days; ... Firefox with a
plugin to gradually build up a database of hits for various queries by
occasionally fetching the nth page of results for one of them, but you
don't publish these anywhere, just use them personally ...

I think the two things that mark a transition to being evil are causing
them excessive traffic and competing with them using their own data in
some way. (Also generating content-free MFA pages to generate revenue
via AdSense ads and SEOing them, but that's more using AdSense than
using the search engine proper, though the SEO will impact the latter
and pollute the results.)

I don't see any way to derive some kind of moral law that makes typing
something morally superior to doing it with one click, and actually
scheduling an automatic (infrequent) job or whatever actually sinful.
There's no inherent virtue in inefficiency, and computers exist to
enable automating tasks. Hyperlinks automate looking up and finding that
dusty reference or whatever; librarians may complain that they rot young
brains but the actual upshot is a gain in productivity, rather than some
kind of evil decadence setting in.

John Ersatznom · Jan 5, 2007

Chris said:
And they do actively work to prevent abuse. There are many kinds of possible
abuse, and I imagine Google work to prevent most of them, but I doubt if there
are many things they dislike more than people attempting to steal their data.

All of this depends on what constitutes "stealing" their data. Copying
it and publishing it? Sort of -- it's some kind of infringement but not
really "theft".

Merely doing with one mouse click or zero what you'd do anyway with
twenty keypresses? I don't see how the amount of clacking emanating from
someone's workstation at location A is in any way relevant to Google as
long as a) a single user isn't suddenly hogging their resources and b)
the user is using the results "normally" rather than to compete with
Google or whatever.

The red flags that would make them look into their logfiles would be a)
excessive bandwidth use and b) a Google clone or whatever springing up
all of a sudden and competing for their revenue streams.

Personal use of the search results isn't anything they can fault. Nor
however a person chooses to generate the requests (so long as they
aren't excessively frequent) or however they choose to filter and use
the results so long as they don't use them commercially.

I see no logical reason for them to care whether the 3 requests a given
IP gave them in a given day came from 30 typed characters and 3 mouse
clicks, 3 mouse clicks, or 0 mouse clicks at the requesting end, as long
as they don't consider 3 requests in one day from one source to be
excessive and as long as they aren't using those results in a way that
competes somehow with Google.

Unless, of course, the real intent is to enforce terms that let them use
a business model based on charging ordinary users a premium merely to
avoid tedium. I hope that isn't their intent; it would violate their
famous motto. A tiered "typed queries are free, bookmarked are a dime
each, and cron jobs require a monthly $59.99 subscription fee and
special account" service where it actually costs them exactly the same
amount (next to nil) to provide for all three use cases seems not merely
silly, but tantamount to fraudulent. A tiered "more than xx queries a
day requires a premium $10/month account" thing with xx in the dozens or
hundreds might not be considered evil -- after all, generating that many
queries actually scales up the amount serving you is costing them per
day. And of course disallowing commercial use of the results (other than
incidentally, like researching a purchase or new hire -- more selling
the results themselves in some manner) without a licensing arrangement
where Google gets a percentage. That's only fair.

nowwho · Jan 5, 2007

John said:
It's not spamming -- I don't know what the other guy was smoking when he
wrote the post you're replying to. There is NO DIFFERENCE discernible to
Google if you

a) do 10 searches during the day by typing in a Firefox window while
doing research or
b) have your computer do the searches with less/no typing on your part

Google is being "ripped off" iff you do something like:

a) use huge amounts of their bandwidth -- well in excess of a normal
user doing a bit of heavy research say, generating large numbers of
searches or delving very deeply into the result set. Fetching 10
first-pages-of-results one for each of 10 queries, whether done by one
mouse click or ten typed-in queries, has little impact on them, and of
course the one mouse click case makes it actually 10 queries instead of
11 because you mistyped one and had to do it again
b) or use google search results to populate your own rival "search
engine" site with revenue-generating ads or what-have-you, either by
scraping google's database or by just putting up a page with a script
that takes peoples' queries and passes them to google, then takes the
result page and replaces google's sponsored links with umpteen flashing
banner ads. Then you're using google's work output to actually compete
against google, rather than simply using google for research. That makes
a crucial difference.

The point of the exercise is to get the URL's returned into an offline
database. It's an excersise purly to pull back the URL's from the
different search engines.

Using code to drive Google lightly and for personal/educational/research
reasons rather than commercial ones doesn't seem to be evil to me,
especially if they cannot in practise distinguish it from "normal" use
anyway, as it isn't producing excessive traffic or being used to compete
against google in some way.

I don't think its a question of good or evil, I think people are
worried that the code could be used for commercial reasons.

In fact, where do you draw the line? Firefox with manually-typed queries
is OK. Then we have Firefox with a MRU for queries; Firefox with query
guessing or autocompletion based on your current activities and
interests; Firefox with a plugin to take the result set too and
transform it e.g. to show 50 rather than 10 hits or to weed out
"supplemental results" that are usually MFA sites that really ARE
ripping off google; Firefox with a plugin to run the query of your
choice and bookmark the results every few days; ... Firefox with a
plugin to gradually build up a database of hits for various queries by
occasionally fetching the nth page of results for one of them, but you
don't publish these anywhere, just use them personally ...

I think the two things that mark a transition to being evil are causing
them excessive traffic and competing with them using their own data in
some way. (Also generating content-free MFA pages to generate revenue
via AdSense ads and SEOing them, but that's more using AdSense than
using the search engine proper, though the SEO will impact the latter
and pollute the results.)

This is an educational project and as computers is not my main interest
of study I don't know what a MFA, SEO are. Can this be explained?

I don't see any way to derive some kind of moral law that makes typing
something morally superior to doing it with one click, and actually
scheduling an automatic (infrequent) job or whatever actually sinful.
There's no inherent virtue in inefficiency, and computers exist to
enable automating tasks. Hyperlinks automate looking up and finding that
dusty reference or whatever; librarians may complain that they rot young
brains but the actual upshot is a gain in productivity, rather than some
kind of evil decadence setting in.

Any help with using the Google API or other suggestions would be a
great help. I also assume that Googe's API won't work with the other
serch engines, so would I have to write a different class for each
search engine?

John Ersatznom · Jan 6, 2007

nowwho said:
The use of other peoples code is allowed , however ALL work and ALL
sources of information used in any way required for the project have to
be detailed, we were well warned about the conquences of plagiarism.
All websites accessed for the project along with any copyright date
must be included along with the date that the website was accessed
etc...

Oh what a tangled web we weave...what happened to the days when you
could just tinker and innovate without fear of lawyers or similar? Hmm?
Of course, wholesale copying of other stuff without permission and
misattributing it as your own original work is simply bad, but it's
because it's fraud and misrepresentation, not because it's copying, IMO.
Wheel-reinventing is supposed to be a bad thing. Let some attorneys get
involved and soon everyone is expecting you to get their permission to
copy anything. Then to *use* anything. Then to breathe or take a leak,
no doubt.

I think it's worth pointing out that unless you've signed something in
writing, you aren't in a binding agreement with Google about anything
(or anyone else) and only copyright, trademark, and patent law has any
true legal force. No matter what TOC boilerplate is on whose website.
Hell, they can't even prove that you *read* it, in any meaningful way,
even if your IP retrieved the page one day.

Of course the defacto law in the US isn't so rosy, thanks to a braindead
court system and a legislature that's long since been ritually auctioned
with great fanfare biannually to the highest bidder. I'd suggest a saner
country. Many in Europe and, I think, even Canada actually still have
sane legal systems, standards for when someone's actually entered into a
binding contract, standards of evidence to get subpoenas, warrants, and
judgments, and whatnot. Australia's as bad as the US or worse though. I
wonder how long it is before individuals have to jurisdiction-shop by
travel agent and $500 one-way airfare express just to do ordinary
victimless activities without legal repercussions and $50,000 in bogus
fines for phantom file sharing someone else on the neigborhood's cable
company internet service may or may not actually have done...

nowwho · Jan 6, 2007

John said:
Oh what a tangled web we weave...what happened to the days when you
could just tinker and innovate without fear of lawyers or similar? Hmm?
Of course, wholesale copying of other stuff without permission and
misattributing it as your own original work is simply bad, but it's
because it's fraud and misrepresentation, not because it's copying, IMO.
Wheel-reinventing is supposed to be a bad thing. Let some attorneys get
involved and soon everyone is expecting you to get their permission to
copy anything. Then to *use* anything. Then to breathe or take a leak,
no doubt.

I think it's worth pointing out that unless you've signed something in
writing, you aren't in a binding agreement with Google about anything
(or anyone else) and only copyright, trademark, and patent law has any
true legal force. No matter what TOC boilerplate is on whose website.
Hell, they can't even prove that you *read* it, in any meaningful way,
even if your IP retrieved the page one day.

Of course the defacto law in the US isn't so rosy, thanks to a braindead
court system and a legislature that's long since been ritually auctioned
with great fanfare biannually to the highest bidder. I'd suggest a saner
country. Many in Europe and, I think, even Canada actually still have
sane legal systems, standards for when someone's actually entered into a
binding contract, standards of evidence to get subpoenas, warrants, and
judgments, and whatnot. Australia's as bad as the US or worse though. I
wonder how long it is before individuals have to jurisdiction-shop by
travel agent and $500 one-way airfare express just to do ordinary
victimless activities without legal repercussions and $50,000 in bogus
fines for phantom file sharing someone else on the neigborhood's cable
company internet service may or may not actually have done...

While the legal information is handy and can (more than likely will) be
included in the report, is there any suggestions on how to tackle the
coding of the problem or suggestions as to where I can look for further
information?

Chris Uppal · Jan 6, 2007

John said:
Oh what a tangled web we weave...what happened to the days when you
could just tinker and innovate without fear of lawyers or similar?

I think the OP's problem here is not so much the legality (or otherwise) of
"borrowing" Google's data, but that this is work in an academic context where
all sources /must/ be declared for reasons of honesty in scholarship.

-- chris

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Good things come in small packages -Choose AWA s pay per clicktraining programs!	0	May 7, 2014
Class Viewer, Google searches, and rankings	36	Aug 23, 2008
Automating Serialization?	0	Nov 27, 2009
Tasks	1	Nov 29, 2022
Dynamically creating webpages via Ruby	4	Nov 24, 2010
Find Out Which Search Engines Returned the Best Page Content	0	Oct 29, 2006
HTML parsing using Java and Xerces	1	Mar 19, 2007

Automating Searches

nowwho

Andrew Thompson

Daniel Pitts

Luc The Perverse

Andrew Thompson

TechBookReport

Andrew Thompson

John Ersatznom

Andrew Thompson

nowwho

Chris Uppal

Andrew Thompson

nowwho

NoNickName

John Ersatznom

John Ersatznom

nowwho

John Ersatznom

nowwho

Chris Uppal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads