Automating Searches

Chris Uppal · Jan 6, 2007

nowwho said:
While the legal information is handy and can (more than likely will) be
included in the report, is there any suggestions on how to tackle the
coding of the problem or suggestions as to where I can look for further
information?

Unfortunately, it appears that Google suspended their Search API last month
(http://code.google.com/apis/soapsearch/), so you will probably have to use
some sort of screen scraping.

If you want to do it in Java (rather than, say, by using command-line tools
such as wget or curl) then you'll need an HTTP client package. Java comes with
one (start with java.net.URL), but it has been said here that Google blocks
access via that, so you may be better off using a different, and more general,
package such as the Jakarta HTTP client
http://jakarta.apache.org/commons/httpclient/

Then, once you have worked out how to download data, you will need to parse it
to find the links you want. Parsing HTML with anything like reliability is not
easy (but you may not need much reliability in this case); you may find this
page of HTML parsers useful.
http://www.java-source.net/open-source/html-parsers

-- chris

Chris Uppal · Jan 6, 2007

John Ersatznom wrote:

[me:]

All of this depends on what constitutes "stealing" their data. Copying
it and publishing it? Sort of -- it's some kind of infringement but not
really "theft".

I don't particularly want to focus on what word(s) best fit the malefaction.
I'll stick with the general purpose "abuse" (which doesn't necessarily even
imply illegality).

Merely doing with one mouse click or zero what you'd do anyway with
twenty keypresses? I don't see how the amount of clacking emanating from
someone's workstation at location A is in any way relevant to Google as
long as a) a single user isn't suddenly hogging their resources and b)
the user is using the results "normally" rather than to compete with
Google or whatever.

Here you are mentioning only one aspect of the abuse (as it might appear to
Google) -- namely overuse of their resources. And I doubt if they are too
worried about that (within reason, of course). But almost /any/ automated
scanning of their database is an abuse in another sense: they make that data
available to people (not machines) in order to make money off it. Their (only,
as far as I know) source of cash is directly or indirectly from the advertising
they include with the search results. If you don't see the advertising then
you are using their resources and data without paying for them. How could they
/not/ want to minimise that ?

The red flags that would make them look into their logfiles would be a)
excessive bandwidth use and b) a Google clone or whatever springing up
all of a sudden and competing for their revenue streams.

Or anything else that suggests that the search results are not being read by a
human...

Of course, they own the servers, they pay the (probably massive) network costs
and other data-centre costs, so it's up to them what they consider "fair". If
they choose to object to people called "Chris" using their services, then
that's up to them -- I have no real right to complain -- they can be as
arbitrary as they like. Naturally, since they want to make money, they can't
be too very arbitrary (and aren't), but by the same token, they do have good
reasons to (try to) protect their services from freeloaders.

-- chris

Lew · Jan 7, 2007

Chris said:
Of course, they own the servers, they pay the (probably massive) network costs
and other data-centre costs, so it's up to them what they consider "fair". If
they choose to object to people called "Chris" using their services, then
that's up to them -- I have no real right to complain -- they can be as
arbitrary as they like. Naturally, since they want to make money, they can't
be too very arbitrary (and aren't), but by the same token, they do have good
reasons to (try to) protect their services from freeloaders.

I am not sure if name-bigotry is covered, but in many countries discrimination
in the provision of goods or services for certain factors like race, religion,
national origin, physical or mental disabilities and some other like
attributes is illegal. The legal principle rests in part on whether a trait is
innate, like national origin, or voluntary, like whether to wear a beard (for
most). This in no wise invalidates points others have made in this thread
except to point out that legal niceties punch exceptions into many broad
generalizations about these topics.

The legal question of data ownership carries many perilous implications. Does
Google own the information, or merely its representation? Is that
representation limited to its appearance on the screen, or does its specific
storage in their databases qualify? What about the source whence came Google's
data - when they scraped information off foo.com to include it in their data,
did they violate foo.com's owner's intellectual property rights? If I scraped
foo.com and came up with similar information to Google's in a similar data
structure (because data structures are "obvious" to a competent software
engineer), have I violated any of Google's IP rights?

Larger jurisprudential question: what degree of data openness or private
ownership best benefits society?

Concomitant question: what constitutes fair use of another's data?

- Lew

Andrew Thompson · Jan 7, 2007

Lew said:
...What about the source whence came Google's
data - when they scraped information off foo.com to include it in their data,
did they violate foo.com's owner's intellectual property rights?

I assume they figure that complying with a 'robots.txt'*
gives them some justification that they were 'invited'
(or at the very least, not exluded or banned) from
the site in question.

* <http://www.robotstxt.org/>

Andrew T.

Andrew Thompson · Jan 7, 2007

Andrew said:
I assume they figure that complying with a 'robots.txt'* ...

E.G. <http://www.google.com/robots.txt>

Andrew T.

John Ersatznom · Jan 8, 2007

Chris said:
Here you are mentioning only one aspect of the abuse (as it might appear to
Google) -- namely overuse of their resources. And I doubt if they are too
worried about that (within reason, of course). But almost /any/ automated
scanning of their database is an abuse in another sense: they make that data
available to people (not machines) in order to make money off it. Their (only,
as far as I know) source of cash is directly or indirectly from the advertising
they include with the search results. If you don't see the advertising then
you are using their resources and data without paying for them. How could they
/not/ want to minimise that ?

If accessing a site in such a way as to not see advertising is "wrong",
then using adblock plugins for your browser must be wrong. Using
Ad-Aware to wipe out those foo.doubleclick.com tracking cookies must be
wrong. Putting "*.doubleclick.com 127.0.0.1" in your hosts file must be
wrong. Hell, walking into the kitchen to fix yourself a snack when your
TV show goes to an ad must be wrong! Maybe even avoiding spam or
deleting it unread...

There is such a thing as taking something too far.

Of course, they own the servers, they pay the (probably massive) network costs
and other data-centre costs, so it's up to them what they consider "fair". If
they choose to object to people called "Chris" using their services, then
that's up to them -- I have no real right to complain -- they can be as
arbitrary as they like. Naturally, since they want to make money, they can't
be too very arbitrary (and aren't), but by the same token, they do have good
reasons to (try to) protect their services from freeloaders.

That's completely aside any legal issues, and down to any business being
able to pick its customers selectively. And, of course, their ability to
do so is limited to the extent that they can detect whatever they don't
like. If they don't like people named "Chris" a Chris can use a phony
name and they won't know the difference unless they start demanding ID
verification to grant access, and they won't do that because it would be
a quick way to self-destruct in the search-engine business.

Automating some of your search usage is similarly something you can fly
below their radar, but in doing so you will clearly have to avoid any
high levels of usage that would bother them and get their attention. But
below that threshold, it's also a case of "what they don't know can't
hurt them"...

John Ersatznom · Jan 8, 2007

Lew said:
Larger jurisprudential question: what degree of data openness or private
ownership best benefits society?

Complete openness, except for national security matters, and those have
to be things like non-stale battle plans that are of use to the enemy if
they get it in a timely fashion. Any other security-based secrecy is
security-through-obscurity; prefer a massive, well-understood defense to
one that depends on the enemy being totally incompetent at espionage.

So-called "intellectual property" may be the single biggest
legal/judicial mistake in history -- far from promoting innovation, all
it seems to do is promote monopolies and lock-in. Check out
againstmonopoly.org sometime. Bad patents are a recurring theme there
and at techdirt, slashdot and other tech sites, but they're just the tip
of the iceberg.

Concomitant question: what constitutes fair use of another's data?

Any private, educational, or nonprofit use should IMO. Of course if I
had my druthers any use at all would. The only things "protectable"
would be personal information, which people would be able to insist
(with legal clout) companies like ChoicePoint delete or at least verify.
And, eventually, the person's actual mind itself, once the technology to
download or otherwise access it with the right tools is available. If I
don't want spammers pestering me at some email address I think I have
that right, but if I publish something nonpersonal by choice I don't
feel I should then try to dictate how others use it.

John Ersatznom · Jan 8, 2007

Andrew said:
E.G. <http://www.google.com/robots.txt>

Unfortunately, one defacto effect of this protocol is that a lot of
sites configure it to deny any automated access and then carve out a few
narrow exemptions for Google and a handful of other big names in search,
on the grounds that nobody else actually drives traffic and business to
their site in any real quantity. The logical outcome is to shut out
smaller search engines and private web-use automation, however. The
former means the current crop of big-name search engines now have a lock
on the market. The latter is simply dumb, since letting people automate
aspects of their web use makes the web (and your site) more useful to them.

Some potentially useful web services are especially likely to be badly
affected. Price comparators, for one. If you run an ecommerce site with
nine competitors, and they all let a price comparator site's bot have
access, and you do likewise, then 90% of the time it will forward people
to a competitor. Obviously as an ecommerce vendor you want to block
price comparator bots! Unfortunately, this is not beneficial to society,
since you are outnumbered by your market, and your market is harmed by
stifling access to information, and the additional ENTIRE market of
online price comparison is threatened if everyone behaves the same.

So there are strong incentives to ignore robots.txt directives for
search engine startups, price comparison engines and suchlike, and
personal automation. Of course, accessing the file but then ignoring a
directive in it is detectable by the site admin who will block your IP,
and the ability to change IPs readily is much more available to the
bigger sites that don't need it than to the smaller sites and
individuals, so that means small-time bots have to not even access it
(and have to fly under the radar -- not too much bandwidth and "look
human").

The good side is that robots.txt does force non-bigname bots to run very
quietly and not use much bandwidth at all or otherwise call attention to
themselves, which serves part of the purpose anyway (one function of
robot directives is to help site admins prevent overuse of their bandwidth).

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Good things come in small packages -Choose AWA s pay per clicktraining programs!	0	May 7, 2014
Class Viewer, Google searches, and rankings	36	Aug 23, 2008
Automating Serialization?	0	Nov 27, 2009
Tasks	1	Nov 29, 2022
Dynamically creating webpages via Ruby	4	Nov 24, 2010
Find Out Which Search Engines Returned the Best Page Content	0	Oct 29, 2006
HTML parsing using Java and Xerces	1	Mar 19, 2007

Automating Searches

Chris Uppal

Chris Uppal

Lew

Andrew Thompson

Andrew Thompson

John Ersatznom

John Ersatznom

John Ersatznom

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads